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Abstract 

In recent years there has been an increased interest in statistical analysis of data 
with multiple types of relations among a set of entities. Such multi-relational data can 
be represented as multi-layer graphs where the set of vertices represents the entities and 
multiple types of edges represent the different relations among them. For community 
detection in multi-layer graphs, we consider two random graph models, the multi¬ 
layer stochastic blockmodel (MLSBM) and a model with a restricted parameter space, 
the restricted multi-layer stochastic blockmodel (RMLSBM). We derive consistency 
results for community assignments of the maximum likelihood estimators (MLEs) in 
both models where MLSBM is assumed to be the true model, and either the number 
of nodes or the number of types of edges or both grow. We compare MLEs in the two 
models with other baseline approaches, such as separate modeling of layers, aggregating 
the layers and majority voting. RMLSBM is shown to have advantage over MLSBM 
when either the growth rate of the number of communities is high or the growth rate of 
the average degree of the component graphs in the multi-graph is low. We also derive 
minimax rates of error and sharp thresholds for achieving consistency of community 
detection in both models, which are then used to compare the multi-layer models with 
a baseline model, the aggregate stochastic block model. The simulation studies and 
real data applications confirm the superior performance of the multi-layer approaches 
in comparison to the baseline procedures. 
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1 Introduction 


Over the last decade, relational data has become ubiquitous in all forms of human activities. 
In many applications of statistics and machine learning, one encounters relational data where 
the entities are represented as nodes or vertices and the relations or interactions between 
the entities as edges of a graph. Applications of such graphs or networks include many 
information systems such as social networks, World Wide Web, user information databases 
in e-commerce, metabolic networks, gene regulatory networks, protein-protein interaction 
networks and food web. 

In majority of the cases dealt with in the literature, the relations are assumed to be 
of the same type such as web page linkage, friendship, co-authorship and protein-protein 
interaction. However in modern complex relational databases and networks, we often have 
information regarding relationships of multiple types among the nodes. For example, in 
the context of internet services a set of users may be connected through email, messaging, 
social media, etc., each one of them creating one layer or type of the user-user interaction 
network (Papalexakis et al. 2013). Similarly, users in a social network can have “friendship”, 
“mentions”, “following”, etc. (Greene and Cunningham 2013) or researchers in academia 
may have co-authorship, citations, title/abstract similarity, etc., as different types of relations 
among themselves. In genomics data, cellular components can have different aspects of 
interactions among them, e.g., protein-protein physical interactions and gene co-expressions 
(Narayanan et al. 2010). Such multi-relational data can be represented as multi-layer graphs 
where multiple types of edges represent the relations and the set of vertices/nodes represents 
the entities (Jenatton et al. 2012). 

One of the most important and widely investigated learning goals in an information 
network is clustering the entities on the basis of the relationships between them into densely 
connected subsets called “communities”. From a probabilistic point of view, communities 
can be thought of as groups of vertices which are more likely to be connected to each other 
compared to the rest of the graph, i.e., the probability of having an edge between two vertices 
belonging to the same group is higher than that of having an edge between vertices belonging 
to different communities. Consequently we would observe the number of intra community 
edges to be higher than inter community edges. 

Many researchers have proposed methods and algorithms for community detection in 
networks. Such methods can broadly be divided into three categories: methods based on 
probabilistic models, methods based on the maximization of a global objective function and 
those based on spectral or matrix factorization of the adjacency matrix or the Laplacian 
matrix. The stochastic blockmodel (Holland et al. 1983; Nowicki and Snijders 2001) is a 
statistical model for random graphs with a natural community structure. It is one of a large 
class of statistical models described in the literature for community detection in complex 
networks, which includes the latent variable (Handcock et al. 2007) and latent space models 
(Hoff et al. 2002), the degree corrected blockmodel (Karrer and Newman 2011; Zhao et al. 
2012) and the mixed membership blockmodel (Airoldi et al. 2008). Various likelihood max¬ 
imization based inference strategies have been proposed in the literature to simultaneously 
infer the block assignments and the parameters in the stochastic blockmodel, e.g., profile like¬ 
lihood maximization (Bickel and Chen 2009), maximizing the conditional likelihood (Choi 
et al. 2012), and variational EM under mixture model settings (Daudin et al. 2008). Other 
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strategies involve Bayesian inference using Gibbs sampling or variational methods (Latouche 
et al. 2011) and optimizing a modularity function over all possible partitions of the graph 
(Newman and Girvan 2004). See Goldenberg et al. (2010) for a detailed review of statistical 
inference in networks. 

Several authors have also studied the conditions required on the growth of the number of 
communities and the degree density of networks for the estimation strategies to be consistent. 
Bickel and Chen (2009) and Zhao et al. (2012) studied the conditions for community detection 
through modularity maximization under the stochastic blockmodel and the degree corrected 
stochastic blockmodel respectively. Choi et al. (2012) laid down the conditions necessary for 
the consistency of maximum likelihood estimation under the stochastic blockmodel. This 
work was extended by Rohe et al. (2012) with a regularized estimator to high dimensional 
settings where the number of communities grows roughly as fast as the number of nodes. 
Celisse et al. (2012) derived consistency and Bickel et al. (2013) derived asymptotic normality 
of the maximum likelihood estimators and their variational approximations in the mixture 
model settings. 

In this paper our primary focus is on the problem of detecting an underlying community 
structure in multi-layer networks. We assume that such networks have an implicit commu¬ 
nity structure and different observed layers manifest that underlying structure with varying 
amount of information and noise. As an example of a network where such an assumption 
is reasonable, we analyze a twitter network of British Members of Parliament (see Figure 
1) where the underlying communities are based on their party memberships and the three 
observed layers, “mentions”, “follows” and “re-tweets” manifest that structure in varying 
proportions. In such cases the multi-layer graph is a more accurate representation of the 
underlying similarity of the objects and each layer can provide only “partial” information 
about the data (Rocklin and Pinar 2011). The goal in such cases would be to correctly 
identify the underlying set of communities combining information from all three layers. 



(a) Mention 

Figure 1: A 3-layer twitter network of British MPs. The nodes are colored according to an 
underlying community structure: the party memberships. 

Earlier approaches towards multi-relational data or multi-layer graph clustering suffer 
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from the deficiency that they either cluster each graph independently and combine the results, 
or aggregate the graphs and cluster the aggregated graph. These approaches fail to take into 
account the dependency among the different layers, in particular the correlation among 
different types of edges that share the same pair of nodes. Moreover, the multiple network 
layers can have different characteristics in terms of sparsity and noise. Some layers may be 
dense but may carry little worthwhile information, whereas some layers may be extremely 
sparse but may carry valuable information. The aggregation process of graphs could lose 
the intrinsic heterogeneity of the network layers. Here we attempt to address the problem 
of how to efficiently cluster the nodes or entities in a network taking into account all types 
of layers or relations among them. Several approaches have been recently proposed in the 
literature for this purpose. Among them are approaches based on collective or joint matrix 
factorization (Nickel et al. 2011; Tang et al. 2009; Rocklin and Pinar 2011), non-parametric 
Bayesian models and latent factor models (Jenatton et al. 2012), extensions of spectral 
clustering (Dong et al. 2012) and modularity (Mucha et al. 2010) to multi-layer graphs. 
However there is a lack of statistical analysis of the properties of those methods. 

For community detection in multi-layer networks, we consider a natural extension of 
the standard stochastic blockmodel to multi-layer settings that we will call “multi-layer 
stochastic blockmodel” (MLSBM). This model, also considered in Han et al. (2014) as “multi¬ 
graph SBM”, is in the spirit of multi-relational models described in Holland et al. (1983), 
Taskar et al. (2001) and Kemp et al. (2006). Han et al. (2014) proved the consistency of the 
maximum likelihood estimates (MLEs) in this model when the number of relations grows. 
They keep the number of nodes (and hence the number of communities) fixed. However, 
as we will see later in both the asymptotic analysis and simulation studies that MLE in 
this model does not perform very well when either the number of communities grows fast 
or the network layers are sparse on average. Hence, we propose a restricted version of this 
model through restrictions on the parameter space which is capable of handling networks 
with a large number of communities. We call this model “restricted multi-layer stochastic 
blockmodel” (RMLSBM). We derive conditions on the growth of the number of communities 
and the average edge density of the networks under which the MLE of the class assignment 
vector is consistent (in the sense that the proportion of misclassihed nodes tends to 0 as 
the number of nodes, and possibly the number of relations as well, grows). We further 
derive the minimax rates of error for community detection in MLSBM and obtain thresholds 
for consistent community detection. To compute the unknown class assignments and block 
model parameters simultaneously, we follow Daudin et al. (2008) and propose a variational 
estimation strategy. 

The rest of the paper is organized as follows. Section 2 extends the stochastic blockmodel 
to multi-layer settings and defines the two models, MLSBM and RMLSBM. Section 3 settles 
the consistency of the community assignments through maximum likelihood estimation in 
the two models when the true data generating model is MLSBM. Section 4 describes a 
few baseline procedures and Section 5 compares the multi-layer models with the baseline 
models in terms of minimax error rate and sharp threshold results. Section 6 describes two 
estimation strategies for the MLEs in the two models. Section 7 describes the results of a 
simulation study to validate the theoretical results. Section 8 presents the application of the 
methods to the Twitter UK politics data set. Section 9 gives concluding remarks. 
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2 Extension of blockmodels to multi-layer settings 


We consider an undirected multi-layer graph G = {V, E}, where the vertex set V consists of 
N vertices and the edge set E consists of edges of M different types representing different 
relations. We can view the multi-graph as a graph with vector valued edge information, 
i.e., the adjacency matrix A consists of elements A lv who are themselves M dimensional 
vectors: A %] = {A^\ A^\ ..., A^ 1 ' 1 }. An alternative way to approach the problem is to 
view the multi-graph as a collection of M, N x N adjacency matrices {A^, A^ 2 \ ..., A( m )} 5 
each corresponding to one particular type of relation. The rest of the set up is similar to the 
regular stochastic block model (SBM) for one-layer case with K blocks (Nowicki and Snijders 
2001). We assume the number of communities K is known. Let z = {zi, Z 2 , ■ ■ ■, zn} be the 
community indicator vector for the N nodes, such that each Zi takes exactly one value from 
the set {1,..., K} and z t — q if and only if node i belongs to community q. Conditional on 
the community indicator vector z, the edges are formed independently as Bernoulli random 
variables with probabilities depending only on the community assignments and the type of 
edges. In what follows we describe the two extensions of the standard SBM to multi-layer 
settings. 

Except for the estimation algorithm, the model is always represented as a conditional 
block model and z is assumed to be a fixed unknown parameter of the model and needs to 
be estimated from data. Conditioned on the community assignments of the nodes Zi and Zj, 
the edges are formed independently following Bernoulli distribution 

A[" l) \(zi = q,Zj = l) ~ Bernoulli(P^). 


The first model assigns a separate probability for the mth type of edge between nodes 
belonging to the qth and the Zth community independent of all other edges. We call this 
model the “multi-layer stochastic blockmodel” (MLSBM). The probability of an mth type 
of edge between nodes i and j belonging to communities q and l respectively can be written 
as 


PW = 7T W = 7T 


V 


ZiZj 


( m ) 
ql > 


i,j e N}, m e M}, q,l e {!,..., Ji}. 


The set of parameters for the model, n = q < l, q, l G {1,..., K}, m G {1,..., M}} 

has K(K + l)M/2 elements. This model is “saturated” in the sense that we have a different 
parameter for each of the different types of edges between nodes belonging to different 
communities. Denote the range of this parameter set or array as II = {m G [0, 

In our asymptotic settings, where both N and M grow and K grows with N, the number 
of parameters to be estimated in the MLSBM grows as K 2 M and quickly becomes large. 
Hence the MLE performs poorly especially when the individual network layers are sparse. 
This problem does not arise in the asymptotic settings of Han et al. (2014) where only M 
grows and N, K remain fixed. However, it has been empirically shown that in most real 
world networks the average cluster size does not grow with the size of the network (Leskovec 
et al. 2008; Rohe et al. 2012; Binkiewicz 2015) and consequently, K grows with N. Hence in 
our asymptotic settings where N grows, keeping K fixed would be rather unrealistic. This 
motivates us to propose the second related model whose number of parameters grows much 
slowly compared to MLSBM. 
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The second model assumes the probability of the mth type of edge appearing between 
nodes i and j is governed by two factors: the first one being the community assignment of 
the two nodes and the second one being the type of edge. Hence the model has two sets of 
parameters: a K x K parameter matrix itkxk corresponding to the community structure, 
and an M x 1 vector (3mx i which contains the parameters for different types of edges. We 
call this model the restricted multi-layer stochastic blockmodel (RMLSBM). 

Notice that in the second model, if the edges were all of the same type, we would just have 
/3 m = P for all m G {1,..., M} and then we will recover the standard stochastic blockmodel, 
with probabilities of edges determined solely by the community assignments. On the other 
hand, if we did not have a community structure, but M types of edges, then n q i would be 
identical for all communities q, l and the probability of an edge between nodes i and j will 
solely be determined by the type of edge. This model can retrieve information from sparse 
but highly informative edge types as the sparsity of the network layers will be captured in 
the [3 rn parameters. Hence, although we assume the edges to be conditionally independent, 
this model induces two types of correlations unconditionally — among the edges of the same 
type and among the edges that share nodes of the same community. 

The probability Pj q n in RMLSBM , which denotes the probability of an mth type of edge 
between nodes i and j belonging to communities q and l respectively, can be modeled in the 
following way with the logit link function 

logit(T^) =7 T ql + f3 m , i,j G {1 me {1,...,M}, q,l G 

This model has K(K + l)/2 + M parameters for an undirected graph. Hence, when both K 
and M grow, the growth rate in the number of parameters for this model is the same as the 
maximum of the growth rates in K 2 and M. In comparison, the number of parameters in 
MLSBM would grow as K 2 M. This makes the maximum likelihood estimator in RMLSBM 
a regularized estimator. 

For the RMLSBM to be identifiable, we require the parameters j3 m to satisfy the condition 
J2m./3m — 0. Hence we have one less free parameter. Denote the set of parameters for 
RMLSBM as ir R = {{ji q ul3 m ) : q < l, q, l G {1,..., K }, m G {1,..., M}} and its range as 
n 11 ’ = {n R G 7 £A(A+i)/ 2 +M^ Y2 m /3 m — 0}. To prove the consistency of maximum likelihood 
estimation under MLSBM, we assume n q u(3 m G (—C log(M!V 2 ), C\og(MN 2 )) for some 
constant C > 0. This condition ensures that n q i and f3 m are bounded away from ± 00 . 


3 Consistency 


In this section, we discuss the consistency of maximum likelihood estimation of the proposed 
models under three asymptotic regimes with varying conditions imposed on the growth of 
the number of communities (K) and the expected total number of edges of the multi-layer 
graph (L). We first define a one to one transformation of the parameters of RMLSBM as 


05r )=i °g it v^+An) 


exp(7T g ; + f3 m ) 

1 + exp ( 71 ^ + / 3 m )' 


(3.1) 
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Now we assume that the data are generated from the more general model MLSBM and view 
RMLSBM as a MLSBM with the following restrictions on the parameters: 

$ = {0 G [0,1 ] A(A+1)M/2 : <; t = logit _ 1 (7 T qi + (3 m ), (3.2) 

vr q i,p m e (-Clog(MN 2 ), Clog(MN 2 ))}. 

This way the MLE in RMLSBM can be thought of as a restricted MLE (RMLE) of MLSBM. 

Our aim is to investigate the consistency of both the MLE and the RMLE under three 
asymptotic regimes where we let either the number of nodes (N) or the number of types 
of edges (M) or both to grow. This setup is quite appropriate for modern day multi¬ 
layer networks, where data collection increases both in terms of new entities as well as new 
features or layers getting added to the database. Consequently methods are being sought 
which would be consistent in such situations. Some consistency results for the MLE were 
obtained in Han et al. (2014) under the settings when M grows, but N and consequently K 
remain fixed. Here we prove consistency results for the MLE in the more general asymptotic 
setting where N can also grow (and K grows with N). We then compare the MLE with the 
regularized estimator in terms of the asymptotic conditions required for consistency. The 
different asymptotic setups we consider under the three regimes of growth in N and M are 
described below. 

1. As both M and N grow, let K = 0(N 1 ^ 2 ) and L = uj(MN( log N) 3+s ) for some S > 0 
for the MLE, while K = 0((MNy/ 2 ~ e ) and L = u(M N (log N) 3+s ) with e, 5 > 0 for 
the RMLE. For the RMLE, we further require that M = 0(N ) so that K does not 
exceed N. 

2. As N grows, M either is fixed or grows slower than N, i.e., either M is 0(1), or 
M —> oo and M = O(N). In this regime, let K = 0(N 1//2 ), L = l>j(N (log N) 3+5 ) for 
some 5 > 0 for the RMLE. 

3. As both N —y oo and M —> oo with M growing faster than N, i.e., M = oj(N), for 
RMLE we consider two related setups: (a) K = 0( l ^ N ), L = l>j(MN( logN) 1+s ) 
for some 5 > 0; and (b) K = 0(N 1 ^ 2 ), L is either w(M(logM) 2+(S (log N) 1+s ) for some 
5 > 0 if (log M) 2+5 = O(N), or cu(MN(log N) 1+s ) for some 8 > 0 otherwise. In 
setting (a), we further require log M to grow slower than N for the growth of K to be 
meaningful. Also, in that setup if log M grows at the same rate as (log AQ^ for some 
j3 > 0 , the number of communities grows almost as fast as the number of nodes except 
for the log terms and is “highest dimensional” in the sense of Rohe et al. (2012). 

Note that the first regime assumes no relation between the growth rates of N and M, 
while the next two regimes assume certain relations between the two growth rates. So the 
last two regimes can be thought of as special cases of the first one in terms of the growth 
rates of N and M. Naturally we expect some relaxation in the required growth conditions 
on K and L in the last two regimes. The asymptotic setups described above reflect this 
relaxation for the RMLE. However no such relaxation is possible for the MLE. Hence we will 
prove that MLE in MLSBM is consistent under the first asymptotic regime, whereas MLE 
in RMLSBM (i.e., the RMLE of MLSBM under the restrictions defined by Equation (3.2) 
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is consistent under all three asymptotic regimes. The MLSBM, despite being intuitively the 
simplest extension, does not perform as well as the RMLSBM for community detection in 
multi-relational networks if the networks are sparse at an average or contain a large number 
of communities. 


3.1 Preliminaries 


Since in this paper our primary interest is in modeling multi-layer networks where layers are 
sparse on an average, we require the true MLSBM model probabilities 7 to satisfy certain 
sparsity conditions. As Zhao et al. (2012) pointed out, if the block model probabilities remain 
fixed as N increases, then the network will be unrealistically dense. In this connection it 
is worth noting that Snijders and Nowicki (1997) let the probabilities remain fixed and as 
a result the networks considered there have linearly increasing average degree, while both 
Bickel and Chen (2009) and Choi et al. (2012) considered networks with poly-logarithmically 
increasing average degree and hence gradually decaying probabilities. Here to keep the 
network sparse, we scale down the block model probabilities accordingly as N increases. 

We introduce a new notation L' to denote the quantity inside the asymptotic notation 
uj in the growth rate of L under different asymptotic setups. As an example, consider the 
case when L = u(MN(log N) 3+s ), then L' = MN(log N) 3+5 . Hence L' can be viewed as 
the minimum rate at which L is required to grow under a particular asymptotic setup. The 
blockmodel parameters are restricted to have an upper bound that decreases with increasing 
N except for a small finite set indexed by the triplet Q = such that the expected 

number of edges in the set \Eq\ = o (yyyiVyny) • For the set Q we can have < 7T^™' 1 < 
1 — . For all {g, Z, m} ^ Q, the parameters are restricted in the following way 


(m) 

Ki G 


c 


u 


\ MN 2 ’ MN 2 (log M log N)‘ 2+5 


(3.3) 


for some 5 > 0 and some constant C, so that the upper bound is determined by the expected 
density of the network. The exact upper bound is determined by L' and consequently, by 
the growth rate of L and varies under the different asymptotic assumptions. 

For any arbitrary partition z of the entities in the graph, the log likelihood of the set of 
M adjacency matrices A = ..., A under the MLSBM with parameters 7 r = 

is 

M 

l(A ; A £{4" )lo g aS + (1 - 4“ } )log (1 - vri^j)}. (3.4) 

m =1 i<j 

Note that for an undirected graph with no self-loops, both A (m ) and 7r( m ), m — 1,..., M, 
are symmetric matrices in {0, l). 7Vx7V and [0, l] A:xA respectively. The Bernoulli parameters 
7r,i™] depend both on the class assignment z and the type of relation m. For a fixed class 
assignment z, let N q denote the number of nodes assigned to class g, and n q i denote the 
maximum number of possible edges between classes q and l. So we have n q i = N q Ni and 
n qq = Co 1 ) ■ F° r an arbitrary partition z, the MLE of is 


= — TWrm = q,Zj = 0, m = q,l = 1,..., A', 


~ (m) 
7T; y, 
(z)ql 


(3.5) 







where 1{-} is the indicator function. Note that for a fixed partition z, the denominator n q i 
in the MLE 7 is the same for all edge types m. 

Now we define the expectation of 7f( 2 ) as 7 f( 2 ) and that of l (A; z, 7 r) as lp{z, n) under the 
independent Bernoulli^ 7 "' 1 ) model. Then we have 

^) q i = ~Y = Q, z j = m = 1,..., Af, g, / = 1,..., AT, (3.6) 


M 

i P (z, it ) = Y Y^r )log 33 + _ 4 m) ) lo § ~ 33 )}- ( 3 - 7 ) 

ra=l i<j 

Clearly for a given z, 7T( 2 ) and 7f( 2 ) are the maximizers of the functions l (A ; z, 7 r) and lp(z, 7 r) 
respectively, and we let l(A\ z) and Zp(z) denote the corresponding maximum values. 

We extend Lemma 1 of Choi et al. (2012) to multi-layer settings as follows: 


l(A-z)-lp(z)=J2J2 

m i<j 


A .log 


0 


A ) log 


^££ n q iD( 

m q<l 



Am) I, (m) 
7T (z)ql II n (z)ql 



where 


M 


X = 


Y Y 33 io § 


- (m) 

7rL;3 


m= 1 i<j 


1 - 7T (m) 
1 71 


(3.9) 


Here D(a\\b) is the Kullback-Liebler divergence between two Bernoulli random variables 
with parameters a and b respectively. This equation decomposes the difference between 
the maximized likelihood and its expected value in terms of 7f( z ) and 7 f(~) for a given class 
assignment vector z. 

Next we turn our attention to RMLSBM. As mentioned before, we consider RMLSBM 
as a restricted version of MLSBM, and the MLE of RMLSBM can be viewed as a RMLE of 
MLSBM under the restrictions. Given a class assignment z, the RMLE 7rl” 2 ^ = {7 p z )qh P{z)m} 
is the maximizer of l R (A) z, tt r ) , the multi-layer block model log likelihood within the re¬ 
stricted parameter space. Substituting the estimated parameters in the likelihood func¬ 
tion gives l R (A]z), the maximum of the likelihood function within the restricted parameter 
space. However, no closed form solution exists for the RMLE. Instead we have the following 
M + K(K + l)/2 estimating equations: 


d 


d(3 n 


:= £ A™ 


(m) exp(7T ZiZj + p m ) 


i<j 


d 


dn v,. 

Z'X '-'J 


£ £ 4 m) - 


1 + exp (7r z . z . + p m )) ’ 
(m) exp(7t 2 . 2 . + p m ) 


i<j m 


1 + exp(7r 2i2 . + i3 m ) 


(3.10) 


(3.11) 


One of the equations is redundant since if we add the equations in (3.10), the resulting 
equation is identical to the sum of the equations in (3.11). 
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Now we use the transformation defined by 0 in Equation (3.1). The likelihood with 
respect to the new parameters can be represented as 


M 


f%W) = ££{4 

m =1 i<j 


(m) 


log 4™} + (1 


(m) 


A y 

A ij 


)io g (i-4”’)}, 


(3.12) 


and the estimating equations in (3.10) and (3.11) can be written as 


1 

N(N + l)/2 




i 

N(N + l)/2 
1 

1V(1V + l)/2 


A ij l)l i Z i = 9, Zj = 1} 

q<l i<j 


£4”’. rn = l,...,M, 


(3.13) 


y H 41 = A y, , £ £ 4” ll { 2 < = 9. % = 0. 9 < i 6 {1. ■... K}. (3.14) 

m q m i<j 

Together the right hand sides of these equations are the complete and sufficient statistics for 
the model, ffence we have K(K + l)/2 + M—1 independent equations which will together 
determine the MLE of K(K + l)/2 + M — 1 free parameters in the set n R y Here it is 
understood that the estimation procedure ensures that the ffiiiteness condition of n q i and 
f3 m are respected possibly by restricting ir q i,/3 m G (—CTog(MfV 2 ), C log(MN 2 )). By the 

functional invariance property of the MLE, j s the mle of 0,"^,. Note 

1 v J ’ Y {z)ql l+exp (it q i+/3m) Y (z)ql 

that the minimum value any 0|™^ can take due to the imposed boundedness constraint is 

1/MN 2 . This value is sufficiently small so that none of the partial sums in the left hand side 

of Equations (3.13) and (3.14) exceeds 1. 

As before we define expectations of 0 2 as 0 2 and that of l R (A] z, 0) as lp(z, 0) under the 
independent Bernoulli(P/ J m ' ) ) model. Then, 


M 

(?(*, 0 ) = £ £{4”‘ ) M4S) + (i - df’) Mi - (a- 15 ) 

m= 1 i<j 


For a given class assignment z, 0 2 and 0 2 are the maximizers of the functions l R (A]z,(p) 
and Tp(z, 0) respectively, and we let l R (A ; z) and l R (z) denote the corresponding maximum 
values. The difference between the maximized values of the observed and expected likelihood 
can be decomposed in two parts similar to Equation (3.8) as follows 


l R (A-, z) - l R P (z) = £ £ n„D (0”>, II $”>,) + A - E( A), 

m q<l 


where as before, 


M 


A^££A< r , ,og 

m =1 i<j 


■'Zi Zj 


Z^ Zj 


(3.16) 


(3.17) 
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A proof of this result can be found in the Appendix. Since the maximum of unrestricted 
likelihood would be at least as large as the maximum of restricted likelihood, we have 
l(A; z) > l R (A; z) and lp(z) > l R (z) for all z. 

Now let z denote the true partition. Further let z and z R denote the MLEs of z under 
the two models MLSBM and RMLSBM respectively, i.e., 

5 = argmaxZ(A,z). (3.18) 

Z 

z R = argmax/' R (A, z). (3.19) 

Z 


3.2 Main results 


We give several theorems in this section as we develop towards our main result. These 
theorems provide insights into the conditions required under the three asymptotic regimes 
discussed in the beginning of Section 3, which in turn provide comparison between the 
asymptotic behavior of MLEs in the two models MLSBM and RMLSBM. All the proofs are 
given in the Appendix. 

The first three theorems bound the difference in the maximized log likelihood and its 
expected value for both MLSBM and RMLSBM as defined in Equations (3.8) and (3.16). 


Theorem 1. Suppose a MLSBM and a RMLSBM, both with K classes and M layers, are 
fitted to the graph with adjacency matrix = {A^\ ..., A^}^, i,j = 1 

where are independent Bernoulli{P,trials. For any class assignment z, suppose 

the estimate 7T( 2 ) = q,l G {1 , ...,AT}, m G {1 ,...,M}} maximizes the multi¬ 

layer block model likelihood 1 (A]z,tt) and the estimate it R ^ = {{fi( z )qh (3( z )m)] q < l, ?,I G 
{1,..., A"}, m G {1,..., M}} maximizes the likelihood from the model with the restricted 
parameter space defined by iR’. Let q, l G {1,..., K}, m G {1,..., M}} be 

defined from tt t F according to Equation (3.1). Then for any e > 0, 



< exp 


(n log K + M(K 2 + K) log 



(3.20) 


P 


t max< > 

2 Itr 


N(N + 1) / S q<l n ql^{z)ql q<l 7l ql ( l ) {z)ql 

2 ^ N(N + 1)/2 N{N + l)/2 


> e 


f o / N M 1/2 

< exp f N log K + (A' 2 + K) log f ——-f 1 


+ M log 


N(N + 1) 


(3.21) 


+ 1 



< exp ( At log A' + (A' 2 + K ) log 


/NM 1/2 

I” 


+ 1 + M log 


N(N + 1) 


+ 1 


(3.22) 
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The first result (3.20) provides a bound for the first part of the right hand side of Equation 
(3.8) for MLSBM. The results (3.21) and (3.22) provide a bound that will be used in Theorem 
3 to bound the first part of the corresponding likelihood decomposition for RMLSBM in 
Equation (3.16). In the proofs of the next two theorems, we first bound the second part 
of Equations (3.8) and (3.16), and then combine the results to provide a bound for the 
difference between the log likelihood and its expected value under any arbitrary partition z 
for MLSBM and RMLSBM respectively. 

Theorem 2. Suppose a MLSBM with K classes and M layers is fitted to the graph whose 
edges A^ are independent Bernoulli(P^) trials. If we further assume that (i) < 

P^ r,) < 1 — jAji for all i < j, (ii) K = O^N 1 ^ 2 ), and (Hi) the total expected number of edges 
of the entire multi-layer graph L = is uj(M N (log N) 3+s ) for some 5 > 0 as 

m i<j 

both M and N grow, then 

max|Z(v4; z ) — lp(z) \ = op(L). 

Z 


The result of this theorem holds under the given conditions irrespective of the relationship 
between the growth rates of M and N. We state the result under the first asymptotic regime 
mentioned at the beginning of Section 3 since we do not get any relaxation in the assumption 
regarding the total expected number of edges if we assume certain relations between the 
growth rates of M and N. 

The next theorem states that the restricted likelihood in RMLSBM is also asymptotically 
well behaved under five independent sets of conditions corresponding to the three asymptotic 
regimes discussed at the beginning of Section 3. The first two sets of conditions correspond 
to regime 1, the third set of conditions corresponds to regime 2, and the last two sets of 
conditions correspond to regime 3. 

Theorem 3. Assume that a RMLSBM with K classes and M layers is fitted to the graph 
whose edges A^ are independent Bernoulli(P-™' 1 ) trials. If we further assume any of the 
following five sets of conditions with respect to the growth of the properties of the model under 
different asymptotic settings: 

(i) both M and N grow, K = 0(N 1 / 2 ), < pW < C N{l ^ )2+S for all i < j, 

where C is a constant, and the total expected number of edges of the entire multi-layer graph 
L = cu(MiV(logiV) 3+<5 ) for some 5 > 0; 

(ii) both M and N grow but M = O(N), K = 0((MN) l t 2 ~ e ) for some e > 0, < 

P^l < C j V ( 1 pg S ^ 2 +<s f or i < j, where C is a constant, and the total expected number of 
edges of the entire multi-layer graph L = u>(MN(log N) 3+s ) for some S > 0; 

(in) M is either a constant or grows slower than N, i.e., M = o(N), K = 0(N 1 / 2 ), 
jA -2 < P 1 )™' 1 < C mn(IoJIi) 2 + s f or a ^ * < J> where C is a constant, and the total expected 
number of edges of the entire multi-layer graph L is o;(iV(log N) 3+s ) for some 5 > 0; 

(iv) M grows and N is either a constant or grows slower than M, i.e., M = uj{N), 

K = 0 ( logAUog m )’ WP - p ij n) - C n logAyiog m) 2+s f or aU 1 < 3, where C is a constant, and 
the total expected number of edges of the entire multi-layer graph L = u(MN (log N) 1+6 ) for 
some 5 > 0; 


12 













(v) M grows and N is either a constant or grows slower than M, i.e., M = oj(N), 

K = 0(N l > 2 ), < min » C n log v(io g a iy^ ) f or aU i < 3, where C is a 

constant, and the total expected number of edges of the entire multi-layer graph L is larger 
than the the smaller of M (log M ) 2+5 (log N) 1+s and MN(log N) 1+5 for some 8 > 0,- 
then, 

m&x\l R (A; z) — lp(z )| = op(L). 

Z 


It is clear from Theorem 2 and Theorem 3 that in RMLSBM, the bound on the likelihood 
can be established both for relatively milder conditions on the expected total number of 
edges and relatively faster growth conditions on the number of communities. As we will see 
in Theorem 5 and the discussion following it, this enables RMLSBM to be a more attractive 
model for community detection either when the number of communities is large or when we 
have relatively sparser graphs. 

Now we are ready to state our main results which show that when the true data generating 
process is a K -class MLSBM, the fraction of nodes misclustered by the MLEs and the RMLEs 
converge to zero under different asymptotic regimes. We define the number of “misclustered” 
nodes N e (z) as the number of incorrect class assignments under z, counted for every node 
whose true class under z is not in the majority within its estimated class under z (Choi et al. 
2012 ). 

The previous results (Theorems 1, 2, 3) hold for any Pj" 1 ' 1 whenever they are bounded 
as described in the theorems. Now we assume further structure on the probabilities, namely 
a MLSBM. Denote the true partition as z, and under the true partition, let the true block 
model parameter array be 7f. Hence, under MLSBM we have 

pM _ =H 

ij n ZiZj- 

Consequently, lp(z, 7r) from Equation (3.7) is maximized by the true model parameter 7 f, 
and we have the maximized expected likelihood as 

M 

tp(z) = log + (! - ^r } ) m 1 - ( 3 - 23 ) 

m =1 q<l 

On the other hand, the expected restricted likelihood is maximized by the parameter 
array tt r under the restricted parameter space of RMLSBM. Note that this is different from 
the true model parameter array 7 f due to the restrictions imposed on the parameter space. 
Using the transformation introduced in Equation (3.1), the maximized expected restricted 
likelihood is 


M 


l R P {z) = £ E^logfe! + (1 - Plf) Ml - Cg)} 

m= 1 i<j 
M 

= l °g( 1 - &,)} 

m— 1 i<j 
M 

= X) n 9 , (^ )log ^ ) + - ^r } ) M 1 - )}■ 

m=l q<l 


(3.24) 
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The next theorem relates the difference between observed and true likelihood with the 
fraction of misclustered nodes N e (z) and the expected total number of edges L to establish 
a bound for the misclustering rate. 


Theorem 4. Suppose the data are generated according to a K-class MLSBM with member¬ 
ship vector z and parameter array rt, the conclusion of Theorem 2 holds, and the following 
conditions hold with respect to the model sequence: for all blockmodel classes q = 1 ,... ,K, 

class size N q grows as s = min{AL} = Ll(N/K), and over all distinct class pairs ( q,l ) and 

<? 

all classes c {q, l}, 


mm mm max 

q,l m c 


D 


7T 


(m) 

qc 




7T, 


qc 


+ 7T 


(m)' 
Ic 


= n (^-) 

\MN 2 ) 


+ D 


4 m> 


7T, 


M i 


qc 


+ 7T, 


Ic 


(3.25) 


then 


N e (z) = o P (N). 


(3.26) 


Note that condition (3.25) is very similar to condition (ii) of Theorem 3 in Choi et al. 
(2012) with the total number of edges for the single layer case being replaced by the average 
number of edges L/M in each layer for the multi-graph. This ensures that any two rows in 
any of the layer matrices 7of 7f differ in at least one entry by at least a constant times 
jf^ 2 - Also, when we take into account the asymptotic conditions required on the growth of 
K and L for the result of Theorem 2 to hold, i.e., K = 0(N 1 ^ 2 ) and L = u>(MN(\og N) 3+s ) 

with M and N both growing, then we have = w ( ^°^i/ 2 —As argued in Choi et al. 

(2012), if L is close to its least possible rate of growth, -AA, goes to 0 for large N and 
the condition is not too prohibitive. For example, if L — MNfiogN) 13 with (3 > 4, then 
(logA r Y — o(N 1 / 2 ), so jfpn § oes to 0 and the condition is not overly restrictive. 

We state the corresponding conclusion for the restricted likelihood estimation (for RMLSBM) 
in the next theorem, i.e., the class membership assignment vector estimated through the 
maximum likelihood estimation in the restricted model RMLSBM is consistent under data 
generated from the MLSBM. 


Theorem 5. Suppose the data are generated according to a K-class MLSBM with member¬ 
ship vector z and parameter array il, the conclusion of Lemma 3 holds, and the following 
conditions hold with respect to the model sequence: for all blockmodel classes q = 1,..., K, 
class size N q grows as s = min{N q } = Q(N/K), and over all distinct class pairs ( q,l ) and 

Q 

all classes c ^ {q, l}, 


mm mm max 

q,l m c 


D 


IT 


M 

qc 


(m) 


7T, 


qc 


+ 7r 


(m)' 
Ic 


+ D 7T 


(m) 


Ic 


_ (m) — (m) 


7T. 


qc 


+ 7r 


Ic 


= n{g), (3.27) 


then under any of the five sets of growth conditions in Theorem 3, we have 


N e (z R ) = o P (h). 


(3.28) 
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Here g in condition (3.27) and the growth rate h depend on the asymptotic conditions imposed 
on K and L. The growth rate h can be determined from g by the relationship h — hL 


MNg ' 


In particular, (i) when K = 0(N 1 ^ 2 ), L = co(M N (log N) 3+s ) with M and N both growing 
arbitrarily, then we have g = = a; ^ lo ^y 2 3+ j and h = N; (ii) when K = 0((MN) l ^ 2 ~ e ), 

L = cu (M1V (log-/V) 3+<5 ) with M and N both growing so that M = O(N), then we have 
g = jfjfj = and h = N; (in) when K = 0(N 1 ^ 2 ), L = uj(N( log N) 3+5 ) and M = 

o(N ), then we have g = = a; ^ lo ^ > 2 + j and h = N/M; (iv) when K = 0(N 1 ~ e / log M), 

L = iv(M N (log N) 1+5 and M = u(N), then we have g = 


LK 


mn = UJ {ldfM) andh = N; (v) 
whenK = 0(N 1 / 2 ), L is u(MN(log N) 1+s ) if N < (log M) 2+5 or uo(M(log M) 2+s (log N) 1+s ) 
if N > (log M) 2+s and M = u(N), then we have g = j or g = = 

andh = N. 


Note that in Theorem 5, we have used generic notations g and h to denote functions of the 
network properties such as N, K and L. The functions g and h vary across asymptotic setups. 
This is so because the regularity condition (3.27) on the difference among the elements of 
block model probability matrices should be as less prohibitive as possible. Note that in our 
results, we have chosen g in such a way that if L is close to its least possible rate of growth, 
then g asymptotically decays to 0 under the assumed asymptotic setup. This ensures that 
our condition (3.27) is not overly restrictive. It also enables us to understand and contrast 
the asymptotic behavior of the RMLE from a unified point of view. 


3.3 Sparse networks 

The results of all previous theorems imply that for sparse multi-layer networks, consistency 
can be achieved with a large number of relatively sparser graphs as long as they together 
satisfy the edge density requirement. In the case when M grows slower than N, in MLSBM 
we do not get any relaxation in the required growth condition on the total expected number of 
edges from all the graph layers combined, and it remains cv(MN(log N) 3+s ) for K = 0(N 1 / 2 ). 
However in RMLSBM we only require the total expected number of edges from all layers to 
be cv(N(log N) 3+s ) for K = 0(N 1 ^ 2 ) (Condition (iii) of Theorem 3). This implies that we 
only require the expected number of edges per layer to be o;(iV(logiV) 3+ ' 5 /M) on average. 
For perspective, if M grows faster than (log N) 3+s , then the average number of edges per 
layer needs to grow only at O(N), which is the sparse bounded degree regime. This case is 
extremely challenging for single layer networks. In comparison, the consistency of the MLE 
in MLSBM requires the average expected number of edges per layer to be a;(iV(logiV) 3+ ' 5 ) 
(Choi et al. 2012) and hence the average degree per layer must grow at least as (loglV) 3+<5 
. Thus consistency can be achieved with a large number of relatively sparse layers. This is 
particularly important as most modern applications of community detection in multi-layer 
graph fall under this asymptotic scenario. 
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3.4 A Large number of communities 

Under MLSBM, consistent community detection is possible when the number of communities 
grows as K = 0{N 1 ^ 2 ) and the total expected number of edges is u(MN(log N) 3+s ) as both 
M and N grow. However, if we assume K = 0((MiV) 1//2_e ) for some e > 0, then we require 
the total expected number of edges to be co(M 2 N(log N) 3+s ) which is unrealistically dense. 
On the other hand, under RMLSBM consistent estimation is possible with comparable edge 
density even when the number of communities grows faster, either as K = 0((MNyP~ e ) 
when both M and N grow but M = O(N), or as K = 0( 1 qo . M\ ogN ) when N grows slower 
than M (Conditions (ii) and (iv) of Theorem 3). Hence the restricted model is advantageous 
for community detection in networks with a large number of communities. 

4 Baseline procedures 

We define three intuitively simple baseline procedures for community detection in multi-layer 
networks. The first two are based on aggregating the layers of the graph and the third one 
is an ensemble of results from single layer community detection through majority voting. 

The first aggregate procedure, which we call “agg-mean” creates a binary network on the 
nodes by adding an edge between two nodes if they are connected in more than half of the 
layers. Hence an edge between two nodes, A“? 9_mean is a Bernoulli random variable with 
probability 

pagg-mean = p^ A. ™ > M/2). (4.1) 

m 

However, this method of collapsing a multi-layer graph into a single layer graph is not 
very useful for the sparse graph regimes we are interested in, because the probability that 
Em 4f > 1 asymptotically vanishes. This can be seen as follows: the random variable 
a sum °f M Bernoulli random variables with different probabilities P-j"' 1 . Hence 
V A-” 1 '* follows a Poisson-binomial distribution and 

4—jm ij 

P(E 4"’ > 1) = 1 -{p£ 4 m) = °) + p C£ 4"’ =!)} 

m mm 

= i - tR 1 - 4 m> ) + E 4 m) n t 1 - 4 l) » -*■ °. 

m m k^m 

if —>■ 0 as N —> oo with M remaining fixed. Hence the new graph created by this 

procedure will have asymptotically few edges. 

A more appropriate aggregate measure is to create a network by adding edges if A •™' 1 > 
0. We call this procedure “agg-sparse”. Note that in this case the edge between two nodes 
^agg-spcirst - g a B ernou jp random variable with probability 

pagg-sparse = p( ^ > 0 ) = 1 - = 0 ) = 1 - ]J(1 - P™) 

m mm 

xl-exp< 4 -2) 
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since P^ —> 0 as N —y oo. Clearly this network is also generated by a SBM with the 
same community assignment vector as the original multi-layer network. The probability of 
an edge, given the block assignments, can also be written in terms of those of the original 
network as 

p^,-,^\ {Zi = qZj = 

m 

Hence from known results on single layer SBM, a maximum likelihood procedure will be able 
to recover the node assignments consistently (Choi et al. 2012). From now on “aggregate 
SBM” will refer to this sparse model. We compare this baseline aggregate SBM with the 
multi-layer models, MLSBM and RMLSBM in terms of minimax rates (Zhang and Zhou 
2015; Gao et al. 2015) and consistency thresholds (Mossel et al. 2014; Abbe and Sandon 
2015; Hajek et al. 2014) in the next section. 

The third baseline procedure is performing community assignment through a scheme 
by which a node is assigned to a cluster if it belongs to that cluster in majority of the 
cluster assignments through MLEs in the individual layers. The cluster labels obtained from 
different single layer MLEs are aligned with each other by solving the linear sum assignment 
problem. 


5 Minimax rates and sharp thresholds 


In this section we derive the minimax rates of misclassihcation error and sharp thresholds for 
consistency of community detection in MLSBM and the aggregate SBM. For this analysis, we 
further assume that all the layers are informative of the underlying community assignments 
even though the quality of that information in terms of “signal to noise ratio” can vary, 
i.e., either all layers have more intra-connnunity edges compared to inter-community edges 
or vice-versa. Formally, for all q, l, m, or 7iqq < 7T^" j for all q, l , m. To align 

notations and settings with Zhang and Zhou (2015), we slightly modify the growth condition 
on class sizes of Theorem 4 and 5 as N q 6 [^, Vf] with s > 1 and redefine the parameter 
space of our undirected symmetric MLSBM with no self loops as 


e ML (N,K,M, a,b,/3) =•{ (z.fif'}) : N, 6 


N S N 

~sK' ~k 


,Vq,Pr > 


(m) - Ot 


M 


N 


if Zi = Zj and < 


/ \ 

Jj- if Zi # Vm 


(5.1) 


with P, z, N q , s, N, K, M as defined previously. Note that the parameters a^ and rep¬ 
resent the lowest intra-community probability and the highest inter-community probability 
for layer m respectively. As per assumption, a ^ > b^ within a layer m, however there is 
no assumption among the relationships of the parameters across layers. We define /^ as 
the Renyi divergence (Van Erven and Harremoes 2014) of order 1/2 between two Bernoulli 
distributions Bern( and Bern(^~), i.e., 


/M 


-2 log 


/ a ( m ) P m ) 
~W~N 


+ 



(5.2) 
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Let z denote the true community labels of the MLSBM and z be an estimate of it. Then we 
define the mis-clustering rate of z with respect to 5 up to permutations as 


r(z,z) = inf dn{z, S(z))/N, 

5 


where £(•) is a permutation of the community labels and dn(-) is the Hamming distance. 
Then we have the following result for MLSBM (proved in the Appendix). 


Theorem 6. 


Under the assumption that 


N y j(m) 

K log K 


—> oo , then 


inf sup E[r(z, z)\ 

Z 0ML 


exp(-(l + e N ) 
exp(-(l + e N ) 


2 >1 


sK 


K = 2, 
K > 3, 


(5.3) 


for any s G [1, \/5/3] and some sequence €n = o(l). Moreover, if N ^ r ff I — = 0(1), then 
inf 2 sup e ml -Zf|b(. 2 (, ^)] h ^ fo^~ some constant c, x.e., at least a constant fractxon of nodes are 
mis-clustered. 


The above theorem implies that for MLSBM, minimax risk of error decays exponentially 

and if - > oo, the rate goes to 0 asymptotically, i.e., exact recovery of community 

labels is possible. Moreover from the proof of Theorem 6 in the Appendix, there exists a 

procedure which achieves this rate. On the other hand if y’. -= 0(1), then the minimax 

risk of error is lower bounded by a constant (see the part on lower bound in the proof in 
Appendix) implying that consistent recovery is not possible in such situations. 

Since the model “agg-sparse” is itself a single layer SBM and E m Pj j > E m if 

Zi = Zj and E m P rj l) < Em T ifz i/ z j, then defining I ag9 as 


jagg = _ 2 log 



fl w V 5( m ) 

m | 


N 


Er 


Urn) 


N 


y 

N 


we have the following result using Theorem 1.1 of Zhang and Zhou (2015). 
Theorem 7. If jj(ogK ^ en 


inf sup E[r(z, z)\ 

z Qagg 


exp(-(l + £w )yP), I< = 2, 

exp(-(l + £N )y^), K> 3, 


(5.4) 


(5.5) 


for any s G [1, y/5/3] and some sequence cn = o(l). In addition, if AAA — 0(1), then 
inf| sup eag g E[r(z, zj] > c for some constant c, i.e., at least a constant fraction of nodes are 
mis-clustered. 


The previous two theorems state results about the fundamental properties of the two 
models which allow us to compare the models without going into the specifics of the method 
used to compute the class assignments in practice. 

Since the Renyi divergence > o for all m, we have Em I^ ^ for all m. Hence 

the minimax rate for MLSBM is lower than all individual single layer SBMs. Moreover, since 
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Renyi divergence is convex, we have jj — Ti^" asymptotically. This can be shown 

using Jensen’s inequality with the concave functions log(x) and \fx = \j (see Theorem 
11 of Van Erven and Harremoes (2014) for a proof), and then noting that asymptotically 

/ \ (m) i.(m)\2 , 

^ m x “— a (m)jv (Zhang and Zhou 2015). Hence the minimax rate of MLSBM is at most 

that of the aggregate graph. Note that equality in the above inequality is achieved if and 
only if all the I^s are equal and l ^- ) is equal for all m. We recognize the quantities 
and J*™) as signal to noise ratios in the mth layer. Hence the MLSBM has lower minimax 
rate compared to the aggregate SBM as long as the signal quality in different layers varies. 

This result will be intuitively apparent if we note from the proof of the above theorems 
that, given the parameters are known or accurately estimated, the penalized maximum 
likelihood (ML) decision rule, which attains the minimax rate of error in MLSBM, weights 
the edges from different layers by c*™' before adding. The penalty terms also get weighted 
by before being added. The quantity c ^ = log ) can thought of as a 

measure of the signal to noise ratio. Hence, layers with high signal to noise ratio, i.e., high 
quality information for the purpose of community detection, get more weight. In contrast, 
the penalized ML decision rule in aggregate graph SBM by construction adds layers without 
weighting. Hence intuitively the result on minimax rates makes sense, since if all layers 
contain the same amount of information, then it is immaterial if the decision rule weights 
the graphs by information content or not, but in all other cases giving more weight to the 
more informative layer pays off. 

Moreover, while it is clear that MLSBM has lower minimax rate than individual layer 
SBMs, it is not true trivially for the aggregate graph. Since R m ) can be written in terms of 
signal to noise ratio as I (m) x -— (ro) ’ , consequently for I a " to be large, the sum of the 
probabilities Yhm an d mus t be well separated. This is not always guaranteed 

as large a^’s and b^’s with relatively low difference can overshadow a large difference in 
smaller a^’s and while adding. We will take this point up again in the next section 

where we discuss sharp thresholds for consistency. 

We note that the model RMLSBM is a MLSBM with a restricted parameter space n^’. 
Hence Theorem 6 will give the minimax rate under the restricted parameter space with the 


divergence in the mth layer being I^- m > 


4 m) N 


, where 0 is the transformation of the 


parameters in RMLSBM as defined before. In particular, we have logit {(j)}" 1 ' 1 ) = a + f3 m . 
The rate for the aggregate SBM under RMLSBM can similarly be obtained using Theorem 

i'V' V' J (?Tl) \ 2 

7 with I a99 being I a " x ^ b —This implies that (a) if RMLSBM is the true 


data generating model then it has lower minimax rate compared to each of the individual 
layers, and (b) by the earlier discussion it also has lower minimax rate compared to the 

/ x # Am) 

aggregate SBM constructed from a RMLSBM graph, since neither nor the ratio = 

1 + SS) is equal for a11 m - 


tr 
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5.1 Sharp consistency thresholds 


We derive sharp thresholds for strong and weak consistency for community detection (Mossel 
et al. 2014; Abbe and Sandon 2015) in MLSBM and the aggregate SBM under two scenarios: 
sparse graph with average degree per layer o(logn) and ultra-sparse graph with average 
degree per layer o(l). 

In the first case, let a^ = a^logN and b^ = a^logN with > 0 for 

all m. Then Corollary 4.1 of Zhang and Zhou (2015) gives that assuming K = N°^\ the 
sharp threshold for the existence of a strongly consistent estimator for the mth layer SBM 


\l°‘ { T ) -\[ c 


,( m ) 


> 1. Hence for the aggregate SBM this threshold is 


/v 


(m) 


-vXL 


(m) 


> 1. 


IS ^ / X. ncii^ 1U1 unx a 56 i C 6 au C xmo uincoiiuiu 10 

Clearly, if the threshold is met in each of the layers, then it will be met in the aggregate SBM 
as well. However in a more realistic case where this threshold is not met in all the layers, 
whether the aggregate SBM will have a strongly consistent estimator or not will depend on 
whether the sum of probabilities meets the threshold of well separation or not, which in 
turn will depend on the relatively denser layers. To see this, note that this threshold can be 


written as 


y' a ( rn )_y' a ( m ) /—t 

,m i _ ^rn 2 > sj For aggregate graph, the denominator of this quantity 

/v-' (m) . (m) -L / 

\JT,m a 1 +VE m «2 


is dominated by the dense layers, and hence the difference in a and b must be large in dense 
layers for the aggregate to be consistent. In other words, strong signals in sparse layers will 
get ignored if the signal in dense layers are not strong. 

On the other hand, for MLSBM, strong consistency is achieved if any of JV7 ‘ m) —» oo 


or their sum goes to infinity. This implies that the threshold is ^ ai — > 1, 
which is achieved if at least one of the layers achieves consistency threshold or the layers 
together achieve the threshold. By the argument before, this threshold consists of sum of 
normalized signal to noise ratios, hence all layers, dense or sparse, get equal weightage in 
determining the threshold. The consistency threshold for RMLSBM using Theorem 6 is 




Vk 


(m) 

— > 1, where log IV and q log N with > 0 


for all m. The corresponding threshold for aggregate SBM generated from a RMLSBM 

T a (rn) -T a ( - m) , - 

is - 1 f . > v K- Here we note that the threshold for RMLSBM is also the sum 




,( m ) 


of normalized signal to noise ratios. However since the parameter space is restricted, the 
difference between inter and intra community parameters are uniform across layers, and 
variations in the above sum only come from the normalizing factor due to the layer specific 
sparsity parameter. 

Qualitatively, the minimax rate and consequently the threshold in MLSBM take into ac¬ 
count variations in both signal quality and sparsity while adding contributions from different 
layers. RMLSBM tries to estimate the signal to noise ratio in each layer by two parameters, 
one global parameter which signifies the aggregate signal quality, and the other layer specific 
parameter which signifies sparsity. Hence although RMLSBM ignores the variation in signal 
quality, it attempts to reduce the undue influence of dense layers by taking into account the 
variation in sparsity. The aggregate SBM, on the other hand, does not take into account 
either the signal quality or the sparsity, and hence is heavily influence by dense layers ir¬ 
respective of signal quality. Hence both RMLSBM and aggregate SBM would perform well 
if all the layers have similar signal strength and similar density. If the layers do not have 
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similar density but the signal strength across layers can somewhat be well approximated by 
an average signal strength, RMLSBM will still be able to detect it through the noise and 
perform well. Clearly, RMLSBM and aggregate graph will not perform well if both signal 
strength and sparsity of layers vary widely, and we need to resort to MLSBM in such cases. 

In the bounded degree case, while consistent recovery is not possible in each of the layers 
since the graph is not fully connected (only detection is possible), a consistent recovery is 
still possible in the multi-layer models. The condition for consistent recovery in MLSBM 
with a = o(l) and b( m ' ) = o(l) is Y' ^ jy 0 ^ e (q ia t the condition for 

detection or weak recovery defined as finding a partition correlated with the true community 
structure for two communities is A = > 2 (Mossel et al. 2012, 2013). 

6 Estimation using mixture model approach 

Simultaneous maximum likelihood estimation of parameters and class assignments in the 
stochastic blockmodel is a difficult problem (Nowicki and Snijders 2001; Choi et al. 2012; 
Rohe et al. 2012). The same difficulties remain in the MLSBM and its restricted version. 
Consequently, to obtain an estimation algorithm here, we view the MLSBM as a mixture 
model with discrete latent variables Z. In this case, Zi is a missing random variable that 
follows a multinomial distribution with K parameters: Zi ~ Mult( 1, a = (op, « 2 , ..., olk))- 
We follow the framework laid out by Daudin et al. (2008) to simultaneously estimate the 
conditional blockmodel parameters and the class assignments with variational EM technique. 
The derivations for MLSBM are straightforward extensions of the corresponding formula in 
Daudin et al. (2008) and are omitted in this paper while the update rules for RMLSBM 
have been derived in the Appendix The update steps for MLSBM and RMLSBM are also 
provided in the Appendix under Algorithm 1 and Algorithm 2 respectively. 

7 Simulation results 

In this section we numerically test the asymptotic results and compare the performance of 
the methods through a simulation study. We generate data from the more general model, 
MLSBM. We then compare the relative performance of the two multi-layer methods (MLE 
and RMLE) between themselves as well as with single layer methods and baseline methods 
such as majority voting and MLE in aggregate SBM. The comparison is done under various 
settings on the number of nodes N, the number of communities K, the number of types of 
relations M, and the expected total number of edges L. 

Since the true class labels of the nodes are known in simulated data, we compare the 
class assignments from different methods with the true labels. We use correct clustering 
rate (CCR) and normalized mutual information (NMI) as measures of similarity between 
partitions. The CCR counts the fraction of nodes whose cluster assignment matches the true 
class label (as determined by the true class label of the majority of nodes in that cluster). 
The higher the CCR, the better the performance of the clustering method. The NMI is 
an information theoretic measure of the mutual dependence or similarity of two random 
variables. The NMI takes values in the range of 0 to 1, with 0 indicating random cluster 
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Figure 2: Comparison of the performance of various methods for three simulation settings under 
two scenarios: all layers are sparse and have strong SNR (left column: (a)(c)(e)), and the layers 
are mixed in terms of sparsity and SNR (right column: (b)(d)(f)). (a, b) fixed K = 10 and M = 5 
while N increases from 100 to 600; (c, d) fixed N = 400 and M = 5 while K increases from 6 to 
22; (e, f) fixed N = 300 and K = 15 while M increases from 3 to 12. The legend in Figure (b) is 
common to all figures. SBMJbest indicates the result from the best performing MLE in the single 
layer SBMs. 
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assignment with respect to the true class labels, and 1 indicating perfect match between 
the true and assigned clusters. If NMI is 0, it means even though the cluster assignment 
was not completely random and done according to some algorithm, the solution presents 
no information regarding the true class labels. Since the results in terms of CCR are very 
similar to that of NMI, we omit those results here to save space. 

In all the simulation studies we repeat the experiments 50 times and take the average of 
our measures across them. We first generate the node labels independently from a multi¬ 
nomial distribution with probabilities P{Zi = k) = oik■ Then we generate the data using 
the node labels and M different connectivity matrices, all of which give larger probability 
to connections within groups in comparison to the connections between groups. However, 
we vary the “signal to noise ratio” (SNR) from layer to layer by varying the ratio of the 
diagonal and off diagonal elements of the parameter matrix. 

We consider two scenarios: (i) all layers are sparse and have strong SNR, (ii) the layers 
are mixed in terms of sparsity and signal strength in the following way: two layers are sparse 
and have strong signal, two layers are dense and have weak signal, and one layer is dense 
with strong signal. While the first scenario is a rather idealistic scenario where all layers are 
“similar” in the sense that they are sparse and strongly informative about the underlying 
community structure, the second scenario (also considered in Papalexakis et al. (2013)) is 
more realistic in applications. For the first scenario, the SNR is kept at 3-4 and sparsity is 
varied slightly from layer to layer in such a way that variational EM algorithm for community 
detection on each of the layer individually gives very similar performance. The connectivity 
matrix parameters are then sampled from a uniform distribution within a small range so 
as to maintain SNR requirement while having different values for each of the entries of the 
matrix. For the second scenario, the informative strong signal layers have a SNR of 3 while 
the non-informative weak signal layers have a SNR only marginally greater than 1. We again 
sample the actual values of the parameters from a uniform distribution within a small range. 

The initial guess for the variational algorithm in both MLE and RMLE is obtained by a 
two step procedure. On a randomly selected layer we first run spectral clustering to generate 
an initial guess and then we use this to run a variational EM algorithm on that layer. We use 
the class assignment and fitted SBM parameters from that layer as our initial guess for the 
MLSBM parameters. In our simulation results described below, the final solution of class 
assignments for both the MLE and the RMLE mostly turns out to be an improved estimate 
of the true class assignments irrespective of which layer we choose to initialize the method. 

7.1 Fixed K and M while N increases 

In this simulation, we take M — 5 types of edges or network layers, each with a separate 
connectivity matrix inducing a different network according to the schemes described above. 
We keep the number of communities K fixed at 10 and vary the number of nodes N from 
100 to 600. The aim of this study is to compare the two multi-layer methods with the single 
layer methods and baseline methods in terms of the number of nodes required to achieve a 
consistent estimation of community assignment with moderately low number of communities. 
Figures 2(a) and (b) display the results from this study for the two scenarios respectively. 
Clearly the MLE in MLSBM and RMLSBM reach NMI of close to 1 faster than the single 
layer ones as well as majority voting as the number of nodes increases. The algorithm in 
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aggregate layer performs similarly to that in MLSBM and RMLSBM for the first (all strong 
signal) scenario (Figure 2(a)), however it performs poorly for the second (mixed signals) 
scenario (Figure 2(b)). This shows that aggregating edges across layers works fine if the 
information quality is similar across layers, but it is not robust if the information content 
changes across layers. The accuracy of majority voting behaves similarly to the single layer 
ones. Moreover, for a small number of nodes, the MLE in RMLSBM performs better than 
all the other methods considered in both scenarios. 

7.2 Fixed N and M while K increases 

In this simulation, we test the performance of the multi-layer methods against the single 
layer and baseline methods with increasing number of communities. We fix the number of 
nodes N and the number of layers M at 400 and 5 respectively, while we let K increase from 
6 to 22 in steps of 4. The results from this simulation study are displayed in Figures 2(c) and 
(d). Whereas the accuracy of community detection in all the single layer methods and the 
majority voting decreases rapidly with increasing number of communities, the multi-layer 
methods explored here, especially the RMLSBM, perform well even with a large number of 
communities. Between RMLSBM and MLSBM, RMLSBM clearly outperforms MLSBM as 
the number of communities grows. This simulation also serves as a test of robustness of 
RMLSBM for small number of communities. We notice that in both scenarios, RMLSBM 
behaves similarly to MLSBM and does not break down for small number of communities. In 
the all-strong scenario, the MLE in aggregate SBM outperforms both MLSBM and RMLSBM 
for small communities, but similar to MLSBM, its accuracy also quickly drops as K increases 
(Figure 2(c)). In the mixed signal scenario, the MLE in aggregate SBM performs much 
worse compared not only to MLSBM and RMLSBM, but also to majority voting and the 
best performing MLE among the individual layers. To put things into perspective, for the 
all-strong scenario, while the NM1 for MLSBM, aggregate SBM, majority voting and the 
single layer SBMs reduce below 0.5, it settles to a value close to 0.8 for RMLSBM as the 
number of communities increases to 20. 

7.3 Fixed N and K while M increases 

In this simulation, we keep the number of nodes N and the number of communities K fixed 
at 300 and 15 respectively, while we increase the number of layers M gradually from 3 to 
12. For this simulation, each layer of the multi-layer network was generated from a Jl-class 
SBM with a simple connectivity matrix given by Pkxk — A Ik + el kxk ~ cIk- I 11 the first 
scenario, the parameters are e = 0.10 + U(—0 .02, 0.02) and A = 3e, while in the second 
scenario, the parameters are e = 0.09 + U(— 0.03, 0.03) and A = L/(1.5, 3)e. Here U(a,b ) is 
a random number generated from the uniform distribution between a and b. Note that in 
the first scenario, all layers are sparse and have strong signals, while in the second scenario, 
we let both sparsity and signal strength vary across the layers. This second scenario would 
be a good test of the robustness of different multi-layer methods. 

We compare the performance of MLE in MLSBM and RMLSBM with majority voting and 
aggregate SBM in terms of the accuracy of community detection in Figures 2(e) and (f). The 
curves for majority votes in both figures remain almost flat with increasing number of layers, 
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indicating that the accuracy of community detection does not improve with more layers. The 
MLE of aggregate SBM performs well initially, but its accuracy quickly falls with increasing 
number of layers as the model assumption that > 1 with vanishing probability 

breaks down. For MLSBM, the accuracy increases initially, however the improvement quickly 
slows down and both the curves in Figures 2(e) and (f) flatten with increasing layers. This 
is because the number of parameters to be estimated also keeps on increasing fast with 
increasing number of layers, which contributes to less efficiency. For RMLE, the accuracy 
of community detection generally increases with increasing number of layers and is almost 
always higher than all other methods. 

The three studies clearly point out the advantages of the multi-layer methods over the 
single layer ones and the baseline ones, as well as the relative advantage of RMLSBM over 
MLSBM within the scope of the simulations. 

8 Twitter UK politics dataset 

In this section we test our methods on a real dataset on interactions between British Members 
of Parliament (MPs) in the social networking site Twitter curated by Greene and Cunning¬ 
ham (2013). Although the original dataset consists of 419 nodes, we only considered the 
largest subset that is connected across all layers for our analysis. Hence our multi-layer 
network consists of 381 nodes. The different layers of network we have correspond to three 
direct relations: “mentions”, “follows” and “retweets”, and three derived relations: “men¬ 
tioned by the same person (co-mentions)”, “followed by the same person (co-follows)”, and 
“retweeted by the same person (co-retweets)”. All relations are assumed to be binary by 
assigning one if the relation is true for at least one case (e.g., if at least one person follows 
both MP i and MP j, then the relation “co-follows” between the two MPs is true). All the 
relations individually can be represented as graphs. For the graphs with direct relations, 
“mentions”, “follows” and “retweets”, a directed edge from node i to node j implies that 
MP i mentioned, followed or retweeted respectively MP j at least once in his/her tweets. 
We converted all directed edges into undirected edges for this analysis. Average degrees of 
nodes in different network layers are presented in Table 1. Note that among the direct lay¬ 
ers, “follows” is relatively dense compared to “mentions” and “retweets”, while the derived 
networks are overall much denser compared to the direct ones. 

Table 1: Average degrees of nodes in different network layers for Twitter UK politics data 


Mentions 

Follows 

Retweets 

Co-Mentions 

Co-Follows 

Co-Retweets 

58.48 

98.34 

31.88 

361.51 

297.21 

147.56 


The goal here is to cluster the MPs into communities based on the information about 
their twitter activities. The ground truth communities are known to be consisting of five 
communities corresponding to the political affiliations of the MPs: 152 Conservative, 178 
Labour, 39 Liberal Democrat, 5 SNP and 7 Other MPs. The clustering quality is assessed 
through NMI and CCR as before. 

Part (a) of Table 2 reports the performance of the algorithm for the six individual layers 
considered. Note that the performance of the derived networks is worse compared to the 
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Table 2: The NMI and CCR for Twitter UK politics data 


Measure 

Mentions 

Follows 

Retweets 

Co-Mentions 

Co-Follows 

Co-Retweets 

NMI 

CCR 

0.4522 

0.8182 

0.5992 

0.9022 

0.4610 

0.7926 

0.3449 

0.7565 

0.2520 

0.7053 

0.4009 

0.8136 


(a) Individual network layers 



NMI 

CCR 

Direct 

All 

Majority 

0.5213 

0.3825 

Aggregate SBM 
0.5819 

0.3326 

MLSBM 

0.6764 

0.5428 

RMLSBM 

0.6821 

0.6250 

Majority 

0.8477 

0.7217 

Aggregate SBM 
0.8871 

0.7506 

MLSBM 

0.9527 

0.8393 

RMLSBM 

0.9553 

0.9107 


(b) Combined network layers 


direct ones despite being denser. Clearly the signal in favor of the ground truth is stronger 
in the “direct networks” compared to the “derived networks”. The performance of majority 
vote, MLEs in aggregate SBM, MLSBM and RMLSBM on multi-layer networks constructed 
from the three direct layers and all layers together are given in part (b) of Table 2. In both 
cases the multi-layer methods outperform the baseline methods, and between the two multi¬ 
layer methods, RMLE outperforms MLE. From the results for direct networks, we note that 
the performance of multi-layer methods is not affected by inclusion of relatively sparse layers 
(“mentions”, “retweets”) and multi-layer methods perform better than the densest layer 
(“follows”), as long as all the signal strength is high. However the performance deteriorates 
as the signal quality becomes bad with the inclusion of poor performing derived networks. 
RMLSBM is more robust towards such layers with poor signal compared to MLSBM. The 
MLE in aggregate SBM performs poorly in the full network due to the number of layers in 
that network being too large. 

9 Discussions 

In this paper we extended the stochastic block model to the multi-layer settings with two 
related models, MLSBM and its restricted version RMLSBM. The community assignments 
through maximum likelihood estimation in both models are consistent under data generated 
from the more general model MLSBM with suitable conditions on the growth rate of the 
number of communities, the number of types of layers, and the total number of edges of 
the multi-layer graph. We also derived minimax rates of error and sharp thresholds for 
consistency of community detection in MLSBM, RMLSBM and a baseline model, the SBM 
obtained by aggregating the layers. We compared the proposed methods with the MLEs 
in single layer networks as well as two baseline methods, MLE in the aggregate SBM and 
majority voting, through results on asymptotic consistency and simulation. 

We demonstrate advantages of the MLE in RMLSBM over the MLEs from single-layer 
SBMs as well as the majority voting and the MLE in MLSBM, both in the asymptotic 
consistency analysis and the simulation studies, when either the number of communities is 


26 











large or the graph layers are relatively sparse. This includes the case when the individual 
layers have bounded average degree, which is an extremely challenging case for single layer 
networks. We would like to emphasize that handling the bounded degree case would not be 
possible with the usual MLSBM extension. Both the baseline methods suffer from deficiencies 
that limit their abilities to detect communities in multi-layer networks effectively. While the 
aggregation of graphs performs poorly if the community structure information contained in 
different layers are heterogeneous, the majority voting fails to infer community structure 
correctly from a large number of layers with week signals. The observations of this paper are 
in line with previous work in regression settings where a parsimonious model with similar 
accuracy is preferred over a model with a large number of parameters. The RMLSBM 
approximates the MLSBM quite well with fewer parameters for most multi-layer networks 
which are sparse or have a large number of communities. Hence in such cases the RMLSBM 
outperforms the MLSBM. 

APPENDIX A 

Derivation of variational inference for RMLSBM 

We derive the update rules for RMLSBM. Note that for the restricted model, the complete 
data log likelihood is given by 

l(A,Z) = l(A\Z) + l(Z) 

= Z n a q + 2 + An) 

i q i^=j q,l m 

- log(l + exp(7T^ + j3 m )}. 

The likelihood of the observed data can be obtained by summing the complete data likelihood 
over all possible values of the unobserved missing class assignment labels Z. However, note 
that the number of all possible assignments grows exponentially as K N , and the sum quickly 
becomes computationally intractable even for moderate N. Hence instead we use the EM 
algorithm for mixture models, where the unobserved class assignments are treated as missing 
values. However one needs to compute the conditional distribution of the missing values 
(class assignments here) given the observed data, i.e., P(Z\A). Unfortunately, as argued by 
Dauclin et al. (2008), P(Z\A) is itself intractable, since the probability of the latent class 
assignments of a node depends not only on the observed edges connected to that node, but 
also on the connectivity pattern of the whole network. 

The variational approximation concentrates the search for optimal class assignments to 
a smaller set by assuming that the class assignments follow a multinomial distribution with 
parameters known as variational parameters. It aims at maximizing an expression containing 
the log likelihood and the negative of the Kullback-Liebler (KL) divergence between the 
true probability distribution of P(Z\A) and its variational approximation Ra{-)- If the 
approximation to the distribution coincides with the distribution, then the KL divergence is 
zero and the variational approximation is the same as the regular EM. So the new objective 
function to be optimized as a lower bound of 1(A) is 

J(Ra) = log 1(A) - KL[R a (•), P(-\A)]. 
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Here we constraint Ra to have the following form of the product of multinomial densities 

i q 

The variational distribution Ra{Z) has the interpretation of being an approximation of 
P(Z\A). 


Algorithm 1: Variational EM algorithm for MLSBM 


while either convergence criterion on parameters not met or t < t max do 
// E-step: Compute variational estimates r = {r^} 
while either convergence criteria on r are not met or s < s max do 
for i 4 — {1, 2, ..., N} do 
for q 4— {1,2,..., K} do 


T, 


(*+l) 




iq 


= exp [{ A 

i<j l m 
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' qlm 
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+ ( i - 4 m) )( 1 
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end 


end 


end 
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// M-step: Estimate the parameters 

for q <(— 1 to Jl do 

_ Rt+i) 

V Z^' iq 
i =1 

for m ■<— 1 to M do 
for l <(— 1 to K do 

yv -(*+1) -(t+1) 4( m ) 

^ ‘ia 1 jl ^ij 


7T 


h+i) 

qlm 


i<j 


E 

i<j 


(t+1 ) (t+D 

T n 


end 


end 
end 

t = t+ 1 


end 


In the E step of the following variational EM algorithm, we compute the variational 
approximation estimates of the probabilities of class assignments for each node. Given the 
model parameters a, it, (5, the variational parameters r can be computed by minimizing the 
function 

J(Ra) = EE Tiq logK) + ^ EEEw^+A.) 

i q i^j q,l m 

- log(l + exp (if q; + j3 m )} ~ EE Tiq log (Tiq) 

i q 
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(9.1) 




Algorithm 2: Variational EM algorithm for RMLSBM 

while either convergence criteria on parameters are not met or t < t max do 

// E-Step: Compute variational estimates t — {t^} 
while either convergence criteria on r are not met or s < s max do 
for i <— {1, 2,..., N} do 
for q <— {1,2,..., K} do 

f£ +1) = exp[a ( g~ M 1 + ex P (*qi + /3m))}] 


s = s + 1 


i<j l m 


// Normalize the variational estimates so that they sum to 1 for 
each i 


At+i) _ -(t+i) /yS Rt- 
1 iq 1 iq / Z—/ 1 iq 

9=1 


// M-step: Estimate the parameters 
for q 1 to K do 


-h+i) _ i 

Utq — N 


= J-Yf { 

N Z-^'iq 
i =1 


// Use BFGS optimization method to find the parameters 

(7r(* +1 ), /?( t+1 )) = argnraxJ(7r, (3) 

7 r,/3 

t = t +1 


with the constraint that X ) y r *9 = 1 f° r *• The solution for the (t + l)th EM step can be 
readily obtained as 


.(m) = 

‘iq 


exp ®) M 1 + ex P(^? + $?))} • 


i<j l in 


In the M step we estimate the parameters of the model by maximizing the approximate 
likelihood. Since we do not have a closed form solution for the parameters n and /3, we use a 
gradient descent algorithm (BFGS optimization algorithm) to simultaneously optimize the 
objective function with respect to all the parameters. The gradients of the objective function 
with respect to n and (3 are 




* 7 V q,i 


:=EEW4 m) - 


i^j m 


) _ exp (Tiff + A ( n) 

1 + exp(7 Tgi + Pm) 

) _ exp(7T^ + ffl) 

1 + exp(7r^ + j3$) 
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The two algorithms corresponding to the two models are described in Algorithm 1 and 
Algorithm 2 respectively. 


Proofs of consistency results 

Proof of Equation (3.16) 
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Proofs of main results 

Before we describe the proves of Theorems 1 and 2, we need the following lemma. 

Lemma 1. For a fixed z, let 7t( 2 ) = (EE Qp e {1,...,X}, m G {1,...,M}} de¬ 
note t/ie MLA of the parameters of MLSBM, and let pE = {(7T( Z )^, P( z )m)’, Q < l, £ 
{1,..., A'}, m G (1,..., M}} fre the MLE of the parameters of RMLSBM. Then for any z, 
we have the size of the set of all possible values that 7t( 2 ) can take as 

A 1 K(K+ 1 ) 


|n w | < 1 — 


N 

K 
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and that ir/h can take as 
(z) 


L b)l ^ 


< ( M 1/2 ( + 1 


i< 2 +k 


N(N + 1) 


where II( 2 ) and 11^ denote the range of and jt R ^ respectively for a fixed z. 

Proof. We first determine the size of the set of all possible values that the MLE of the 
parameter array n can take in the MLSBM. Notice that from Equation (3.5) the estimate 
f[t m ) of the parameter matrix for any layer m can take any of the W q <i(n q i + 1) values, since its 

K(K + l)/2 upper diagonal components (7f^, q < l, q, l G {1,..., K}) can take any of the 
n q i +1 values in the set {0,1 /n q i,.. ., 1} independently. Hence, |n| = n 1(1 ( n qi + !)• However 

m q<l 

(N\ 

this is subject to the constraint that J2 n qi — ( 9 ) • This implies that |n| is a product of 

q<l \ Z J 

^ ^ positive terms whose sum is fixed. So |ff| is maximized when the terms are all 


equal, i.e., n q i = 
inequality 


K + l 
2 


uniformly across all m. Hence we have the following 


(N 2 1 

< (iV +1 


MK(K+ 1)/2 


MK(K+l)/2 


(N 1 

<U +1 


MK(K+ 1) 


Now we turn our attention to the set of values the MLE of the parameter array in 
RMLSBM can take. Note that Equations (3.13) and (3.14) together represent K(K + l)/2 + 
M equations involving partial sums of the MLEs of the K(K + l)/2 + M elements in the 
parameter array 1 t r (although the equations are written in terms of the transformation (f) for 
convenience, they actually represent the same equations as Equations (3.10) and (3.11). The 
right hand side of the equations together are the sufficient statistics under the RMLSBM. 
Note that due to the identibablility constraint, we have only K(K + l)/2 + M — 1 free 
parameters. On the other hand, one of the equations in the set of equations is also redundant, 
since adding together the first M equations represented by Equation (3.13) and adding the 
remaining K(K+ 1)/2 equations represented by Equation (3.14) yield the same equation and 
hence there is one linear dependence. This set of equations determines the MLE of ir R . Hence 
the size of the set of all distinct solutions ft R is at most the number of possible sets of system 
of equations. To determine the later, we notice that the right hand side of each of the first set 
of M equations can take N(N + l)/2 + 1 values from the set {0, 2/[IV(IV + 1)],..., 1}, while 
the right hand side of each of the next set of K(K + l)/2 equations can take Mn q i +1 values 
from the set {0,1 /(Mn q i ),..., 1}. So the size of the set of possible values the estimated 
parameter array n R can take is 

in s i<n(M»„ + i)n(^±i +1 

q<l m= 1 ' 
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The first term is maximized as before when all the n q i s are equal, i.e., n q i 
The second term is a fixed quantity. So we have 



U R \ < 


< 


< 





K(K+ 1)/2 




K(K+ 1) 


+ \ K{K+1)/ 2 ^jV(fV + 1) 
f N(N + l) | \ M 
/7V(Ar + l) \ M 



I< + 1 

2 


Lastly notice that the transformation defined by Equation (3.1) is an onto function but not 
necessarily one-to-one, so one or more parameter arrays 7 t r map to one 0. Hence for every 
estimate 0 there exists a corresponding estimate array n R . Therefore we have 


<S| < |ri R | < 




K(K+ 1) 


m/v + i) | 


0 

For brevity of notation henceforth we remove the subscript (z) from n R ^ and 0^), 
denoting the set of parameters of MLSBM, RMLSBM and the transformation of the set of 
parameters of RMLSBM respectively for a fixed z. We also remove the subscript (z) from 
LI( Z ) and rig } . 

Proof of Theorem 1 

The proof for the unrestricted case follows the structure of the proof of Theorem 1 in Choi 
et al. (2012). Following the arguments in the aforementioned paper, we first notice that for 
a fixed z, each estimate is a sum of n q i independent Bernoulli random variables with 

mean . Hence the probability that 7T^'' ) = u, where v G {0,1 /n q i,... , 1} can be bounded 
as 

p (^ l) = v)< exp (- n q iD(u || fj 0 )) , 

and by the independence of A the bound on the probability of any realization i r is 

P(n) < exp II • 

\ q<l m / 


Recall n denotes the set of values the estimate array 7r can take for a fixed class assignment 
z. In Lemma 1, we have bounded the size of this set as |n| < (^ + i^ MK ^ K+1 \ Now we 
consider the event that Y^ q <i n qi 'Thm £ ) (d^ n) || 7f^) is at least as large as some e > 0, and 
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derive an upper bound for its probability of occurrence: 


= = T, P ^ 

\ q<l rn ) * e n e 

< J2 ex p (- n q i d (Kt ] ii <r } )) < J2 ex p(- e 

* PTT« V q<l rn J Jen 


N 


= |n e | exp(-e) < |n| exp(-e) < ( - + 1 


eiq 
MK(I<+ 1) 


exp(—e) 


Hence for all e > 0, we have over all K N possible class assignments z, 


P 


(max n ql ^ D(n ( ™ } j | n ( q f) > e") < P (\J 

\ q<l m / \ Z \ q<l m 


dr 1 ii if) > * 


< K n exp ^ MK(K + 1) log + l^j — < exp ^NlogK + M(K 2 + K ) log — e 

The proof for the restricted case, although follows the same structure as before, is more 
involved as we need to deal with estimating equations instead of closed form solutions. 
Note that for a fixed z, the left hand side of each of the M estimating equations in (3.13) is 
7 V(jv+i )/2 Sg<z n *i%\ w hich is a sum of N(N + l)/2 independent Bernoulli random variables 
with mean 5 ~2 q <i n qi4^ respectively. Hence the probability that 

jvpv +^/2 J2q<i n qi^qi' ) = u ™, where u m G {0, 2/[iV(iV + 1)],..., 1} can be bounded as 


( Eq<i n q i<t > K 4 

P ( N(N + l)/2 ~ l ' m ) — exp 


U m ) 


N(N + 1) 


D z/„ 


fn )' 


E q <i n q ir ql 

N(N + l)/2 


for m G {1,..., M}. 

Similarly the left hand side of each of the K(K + l)/2 estimating equations in (3.14) 
i s if E m ^qi'* ’ which is a sum of Mn q i independent Bernoulli random variables with mean 

jrq Hence the probability that ^ E m V'T = v qh where G {0,1 /(Mn q i), ..., 1} 

can be bounded as 


P 




< exp —Mriq/D 






for q < l, q, l G {1,..., K}. 

Now since these K(K + l)/2 + M estimating equations together determine the MLE tt r 
of RMLSBM, the probability of any realization of tt r is bounded by the joint probability 
of the occurrence of the estimating equations. Note that although the equations within the 
two sets (3.13) and (3.14) are independent of each other, the two sets of equations are not 
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independent of each other. Hence because of the inequalities that P(A D B) < P(A) and 
P(A fl B) < P(B), we have 


P(tt r ) < JJP 


N(N+ l)/2 




M 


q<l 


I N(N + 1 ) ( Thq<l n ql^ql ^ 

< exp - 2^-x- D 


N(N+l)/2 N(N + l)/2 


Eq<l nqlFql 


(m)' 


(9.5) 


and 


^ infer) 

q<l \ m / 

<exp l-J^MnqiD ij^Y 1 

\ q<l \ in 


Hm) 


1 \ Am) 


M 


£■ 


V 


(9.6) 


For brevity, we call the right hand sides of Equations (9.5) and (9.6) as exp(— Ei) and 
exp(— E 2 ) respectively. From Lemma 1, we have the size of set of all possible values tt r can 
take 




+ 1 


M 


Now we consider the event that E % is at least as large as some e > 0 for i — 1, 2 respectively. 

r R\ tj ( r- ttR . 


P{ nf) = P(n K e n K ; E t > e) = Y p ( 

7r«enf 

/ at \ *(*+!) 
< \U R \ exp(-e) < fM 1/2 -^ + 1 j 


7t R ) < 


Y exp (-Ei) 

Tr^efrf 

(N(N + 1) xM 


+1 exp(-e). 


Hence for all e > 0, we have over all K N possible class assignments z, 

/ ;w iW' 


2—' P ^ jy ( ^T + l)/2 JV(JV+ l)/2 


< exp 


> e 


Mog/i + (Jl 2 + ih) log (Vfe + 1^ + Af log _ 


and 


P | max < 


< exp 


Y. Mn * D (jfY.O’*" || jiT.fi 

q<l \ m m 

N log K + (K 2 + K ) log (m 1/2 ^ + l) + M log 


— e 
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Proof of Theorem 2 


First we note that X, as defined in Equation (3.9), is a sum of bounded independent random 
variables, because each element in the sum is bounded by C = 2 log (VMN) in absolute 
value. So we can use a Bernstein type inequality for sums of bounded independent random 
variables (Chung and Lu 2006) to obtain 


/ 


P(\X — E(X)\ > e) < exp 


< exp 


see m , 

m i<j 


[f 2 1 + 


\ 


8L log 2 (VMN) + ±e\og(VMN) J 




since EE-PtEE] = EEPp m) log 2 (7r^ ) /( 1 “ tt^)) < 4L log 2 (\/MlV). Combining this 

m i<j m i<j 

inequality with the result in Theorem 1, we have over all possible K N class assignments z, 


maxP(|/(A; z) — lp(z)\ > 2 eL) 


< max I P E n q i 

\ \ q<l m 


(m) 11 — (m) 
ql 


n^)>eL}+P(\X-E(X)\>eL) 


< exp NlogK + M(K + A")log — + 1 - eL 


N 


K 


exp N log K — 


e 2 L 


8 log 2 (VMN) + |elog (VMN) ) 


), 


which goes to zero asymptotically as N grows under the growth conditions mentioned on K 
and L. So we have 

max) l(A;z) — lp(z) \ = op(L). 


Proof of Theorem 3 


The proof for the RMLSBM will be a slight modification of the earlier proof for MLSBM. 
As before we need to bound the two terms in the decomposition of the difference between 
maximized likelihood and its expected value defined in Equation (3.16). For that we write 
the first part in the right hand side of (3.16), which we call E 3 here for brevity, in terms of 
the quantities we have already bounded in Theorem 1. We begin by noticing that, since the 
Kullback-Liebler divergence D(a\\b) is convex, we can use a reverse of Jensen’s inequality 
(Simic 2009; Budimir et al. 2001) to write 


ii <r’) 


iv(A'+i) n / E,<.”Jir l 

2 ^!V(jV + 1)/2 


E q <i n q i<j>y \ 

N(N + l)/2 J 


+ log(iW7V 2 ), 
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and 




ZM 

D ql 


l(m) 


< Mn ql D [ ^ J2 


'dm) 

°qi 


-hY,*™ ) + iog(Jkfjv 2 ). 


M 


To derive the inequality, we used — log(0^/) as our convex function of on the 

interval [l/(MiV 2 ), 1 — l/(MiV 2 )] to obtain a reverse of the “log-sum inequality”. Summing 
the two inequalities over m and q, l respectively, we have 


E :i < 2 V N(N + 11 D ( ) + 0 (M(log(VMJV)) 1+i ), 

^ 2 \N{N + l)/2 N(N + 1)/2J y K V 7 


d m )' 


and 


E 3 <2j2Mn„D(dj2‘ 

q<l \ m 


Um) 

ql 


JJ E ^ ) + o(K 2 (\og('/MN)) 1+s ). 


Hence E :i is bounded by the minimum of the above two upper bounds. Since the first part 
in the right hand side of the above two inequalities is bounded by the same quantity, we will 
take the inequality for which the second part is smaller. Under the conditions on the growth 
of L in the theorem, the minimum of the two second parts is o(L). Consequently, 


maxP(\l R (A-,z) - l${z)\ > 2 eL) 

Z 

< exp (n log K + (K 2 + K) log M 112 ^ + 1^ + M log — 


eL 


+ exp N log K 


e 2 L 


8 log 2 (VMN) + felog N )' 
so under the growth conditions mentioned under different asymptotic settings, 

max\l R (A) z) — lp(z)\ = op(L). 


Proof of Theorem 4 

For MLSBM, if the conclusion max| 1(A) z) — lp(z) \ = op(L) of Theorem 2 holds, the data 

are generated according to a Jl-class blockmodel with membership vector 5 and probability 
matrix 7 f, and the maximum-likelihood /l-class blockmodel class assignment estimator is z, 
then it is easy to see 

lp(z) — lp(z ) < lp(z) — lp(z ) + l(A, z) — l(A , z) (9.7) 

— I lp(z) — KA, z)\ + | lp(z) — l(A, z )| = op(L). 

Note that the terms lp(z) — lp(z) and l(A,z ) — l(A,z ) are positive quantities as mentioned 
earlier. 

The rest of the proof requires the concepts of partition and refinement as laid out in 
Choi et al. (2012). We briefly review the concepts here and apply them to MLSBM and 
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its regularized version RMLSBM. Let [N] denote the set of integers {1, 2,..., N}. Any 
multi-layer blockmodel induces a partition of the M upper triangular probability matrices. 
Formally we define a partition of { P^ %<i into U subsets {Si,..., Su} by the following 
mapping 

[iv], je[N], i<j -t [U]. 

Note that the partitions induced on all M probability matrices are the same, since the 
partition is a function only of the indices and not of the type of edges. There exists a 
bijection between the set [U] and the upper triangular part of the parameter matrices of 
MLSBM, so we can write 7r e(i,j) — 7l z l z :i ■ 

In MLSBM, for a general partition, we define S u = {( i,j ) : 0(i,j) = u, i < j} and 
tt u = |S U | -1 ^ Pij i so that we can define the log likelihood under this partition as 

m Q{i,j)=u,i<j 


M 

m = E 4A) + (1 - dfWg (1 - 4A))t 

m=1 i<j 

It is easy to see that l* P (Q z ) = Tp(z), where @ 2 is the partition corresponding to block 
model assignment z. A refinement 0' of partition 0 further subdivides the partitions in 
0 into subgroups or sub-partitions so that 0 (A, j\)h<ji — 0 (A, ^ 0(A,ii)«<ji = 

0(A,.72)* 2 <j 2 - From Lemma A2 of Choi et al. (2012), it can be easily obtained 

l* P (G) < l* P (Q'). 


One such refinement is constructed in the following way (Choi et al. 2012). We consider 
a K class MLSBM with membership vector z and let @ 2 denote a partition of { P-z for 
any z. Now, for a given membership class under z, partition the corresponding set of nodes 
into subclasses according to the true class assignment z of each node. Then remove one 
node from each of the two largest subclasses so obtained, and group them together as a pair; 
continue this pairing process until no more than one nonempty subclass remains. If pair 
(i,j) is chosen from the above procedure, then z t = Zj and z t ^ z r Define C\ as the number 
of ( i,j ) pairs selected by the above method. Since at least one of i or j is misclustered, we 
have N e (z)/2 < C\ < N e (z). 

Next, for each C\ pairs hnd all other distinct indices k for which condition (3.26) of the 
theorem is satisfied. Let C 2 denote the total number of distinct triples that can be formed 
in this manner. For each of the C 2 such triples (■ i,j,k ), we remove P\k and Pjk from their 
previous subset assignment under @ 2 and place them in a new distinct two element subset. 
This partition so created is a refinement of the original partition 0 2 , and we call this refined 
partition 0 2 . The condition (3.26) of the theorem implies that for each pair of classes ( q,l ), 
there exists at least one class c that satisfies, 


D 



7T 


(m) (m) 


qc 


+ 7T 


Ic 


+ D 



> lk 

2 J ~ MN 2 ' 


(9.8) 


Consequently for any of the C\ pairs of nodes under the true partition, we obtain triples at 
least as large as the cardinality of the smallest class. Hence C 2 is at least as large as C\ s\ 
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where s the size of the smallest class. Now as per assumption, s = Q(N/K). Hence we can 
bound the difference in the likelihood: 




ij 


m i<j 


- o*'» (mi) 


•««*») 


N e {z) n , T , MNKL 
2 1 ’ KLMN 2 


N e (z ) 

N 


n(L). 


Since the above procedure is valid for any class assignment vector z, we can apply it 
for the maximum likelihood estimate 5 as well. Note that 5 induces partition Q z of the 
probability matrices m={i an d refinement <d' z increases the likelihood, 

i.e., l* P (Q z ) < l*p(Q' z ). Also we have l* P (Q z ) = lp{z). Consequently we have, 

lp(z) - lp(z) > lp(z) - /M©' 5 ) = ^n(L). 

Combining this with the result from Equation (3.25), we have 


N e {z) = o P (N). 


Proof of Theorem 5 

Before we proceed with the proof we need two lemmas. The first lemma bounds the difference 
between the maximized expected likelihoods from the unrestricted and the restricted models 
under the true partition. The second lemma uses this result along with the result of Theorem 
3 to bound the difference between the maximized expected likelihood for the restricted model 
under the RMLE and the maximized expected likelihood for the unrestricted model under 
the true partition. 

Lemma 2. Under the true partition z, if any of the five sets of conditions in Theorem 3 
on the growth of multi-layer blockmodel parameters holds, then lp{z) — lp{z) = o P (L), where 
L is the expected number of edges in the multi-layer graph under the corresponding set of 
conditions. 


Proof. For large N , subtracting Equation (3.24) from Equation (3.23) we have 


HiuHi 


lp(z) - lp{z) 

= X/ D (i'qi II Vqi > 

q<l m 

fMN(N + 1 ) 


<\E Q \\og(MN 2 ) + 


V 


v 


I ' Eq I ) Cl MN 2 (log M) 1+5 (log N) 2 + 6 


log 


C\L' j (MTV 2 (log M) 1+<5 (log N) 2+s ] 
1 /MN 2 


38 



C L' 

P ^ ^ (log M) 1+<s (log N) 2+s ^ 

—op(L') + op(L') log ^ ClL 

—o P (L') + o P {L')R 
=op(L ), 


C\ L' 


(logM) 1+<5 (log N) 2+s 


(log M) 1+l5 (logiV) 2+<5 i 

^ [(log M) 1+<5 (log iV) 1+<5 ] 


where C\ is a constant and R = log ( (logM) iS^ gjV) 2 +a ) j [(log M log N) 1+s ], The inequality 
in step 2 comes from the upper bound on D(p\\q) which can be derived as follows. Without 
loss of generality, we can assume that p> q and D(p\\q ) < plog 2 < p max log ■ Next we 
replace p max and g min by the assumption on the lower and upper bounds of the restricted 
block model probabilities given in Equation (3.3). 

Now to complete the proof, we only need to verify that under the Eve sets of conditions in 
Theorem 3, the term R in the right hand side of the above derivation is o(l). Under the first 
two sets of conditions, L' = MN (log N) 3+s and consequently R = log ( A ^°g WUeW + ) = 

o(l). Under the third set of conditions, L' = iV(logiV) 3+<s and hence R = —- = 

o(l). Finally under the last two sets of conditions, if L' = MN(log N) 1+s then R = 
= “I 1 )- md if L ' = V(logM) 2 «(logiV)'+« then R = iggiggp = o(l). 

□ 

Lemma 3. Under the true partition z and the RMLE of the partition z R (i.ethe MLE in 
the restricted model RMLSBM), we have lp(z) —lp(z R ) = op(L) whenever the conclusion of 
Theorem 3 holds. 

Proof. Note that lp(z R ) > l R (z R ) since the maximum of the unrestricted likelihood lp(z) is 
uniformly larger than or equal to the maximum of the restricted likelihood l R (z) for all z. 
Moreover, z maximizes Ip(-) and hence lp{z) — l R (z R ) > 0. Notice that l R (A, z R ) — l R (A, z) 
is positive since the observed restricted likelihood is maximized at z R . So we have 

l P (z) - l R (z R ) < l P (z) - l R (z R ) + l R (A , z R ) - l R {A , z) 

< \l P (z)-l R (A,z)\ + \l R (z R )-l R (A,z R )\ 

< | ip(z) - J$(z )I + I J$(z) - l R (A,z )I + I l R (z R ) - l R (A;z R )\ 

= o P (L), 

by Lemma 2 and Theorem 3. 

□ 

Now we are ready to show that the class membership assignment vector estimated through 
the maximum likelihood estimation in the restricted model RMLSBM is consistent under 
data generated from the MLSBM. We define regularized partition © R of the matrices of 
probabilities between nodes P 1 -™' > , computed according to the restricted model RMLSBM and 
its refinement Q R in exactly the same way. We further define the corresponding restricted 
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log likelihood associated with this partition Q R as Zp R (0 R ). For convenience we again resort 
to the transformation defined by Equation (3.1) 


M 


me R )=e +1 1 - dr 1 ) mi - 1< ’” ) 


& R (i,j) 


)}■ 


m=l i<j 


For any membership assignment z R from the RMLSBM, let l* P R (0 R t ) be the corresponding 
partition of Pjj"' 1 ■ It follows from this dehnition that l* p R (Q R H ) = l R (z R ). Hence we have 


r P (z)-r P R (e%) = J2'E D { p <i' 


m i<j 

N e {z R ) 


>( m ) II A m ) 


e'5W) 


= C 2 MQ(g) = ChMVt ( 


„, tS MN N e (z R )„, r , 
Q(L)——g = — —-Q(L). 


2 7 KLh 

Now we specialize to z R . Since Q R is a refinement of 0 R , it increases the restricted likelihood, 
i.e., l*p R ((-y R { ) > l* P R {Qf R )- Using this and the fact that l* P R ((-) R t ) = l R (z R ), we have 

i P {z) - l%(z R ) > l P (z ) - 1* p r (Q' r r ) = 


h 


The left hand side is o(L) by Lemma 3, and hence, 

N e (z R ) = o P (h). 


Proofs of minimax and threshold results 

Proof of Theorem 6 

For brevity we mention here only the results and proofs that differ from the proof contained 
in Zhang and Zhou (2015) and refer the reader to the aforementioned paper for a complete 
description of the techniques involved. We define the homogeneous/symmetric multi layer 
stochastic blockmodel as the MLSBM with the parameter space Q\ IL that has all intra-block 
connection probabilities equal to each other as well as all inter-block connection probabilities 
equal to each other for each layer. As before, we assume no relation among the connection 
probabilities of one layer with that of another layer. The parameter space can be written as 

Q f il {z, N, K, M, a, b) = 

, , h( m ) ] 

if z % = Zj and if z, t ^ Zj, Wm >. (9.9) 

Note that this model space is homogeneous and uniquely determined by z, i.e., given the 
community assignments z, the block model parameters are uniquely determined. This model 


(*, {dr 1 }) e e ML 


p( m ) _ 

ij 


2 (m) 

~w 
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space is also closed under permutations, in the sense that the model obtained through per¬ 
muting the class labels also belong to @* /L . We further define a submodel of this where the 
block sizes are all (almost) same as 


CsML 


N 


(z, N, K, M, a, b) = < (z, {P^}) E ©f L (z, N, K, M, a, b) : N q = (1 + o( 1))-, Vg 


(9.10) 

and yet another submodel space of Qq IL where the communities are of only 3 sizes, |_^J, 
|_^J — 1 and |_^J + 1. This submodel space, denoted as Qf lL is the least favorable case for 
community detection in terms of the size of communities (See Section 5.1 of Zhang and Zhou 
(2015)). The parameter space can be written as 


N,K,M, a, b, S) = 


( 2 , {P‘”>}) € 0 o M1 (z, N, K, M, a, b) : 


«» = 


N 

K 


= Sx, 


q:N q = 

N 

K 

+ 1 

= s 2 , 

q:N q = 

N 

_K 

- 1 


S 3 , 


Si + s 2 + s% 



(9.11) 


The submodel spaces ©q /l and Q^ IL are also homogeneous and closed under permutation. 
Let 5 be the class assignment obtained from some procedure under consideration. We break 
the proof up into two parts, the first one proves a lower bound for the minimax risk and the 
second one shows that there exists an algorithm which attains the lower bound. 


Lower bound 


It was argued in Section 5.1 of Zhang and Zhou (2015) that Of 11 ' is the least favorable 
subspace of Q AIL using the property of being closed under permutation. Hence, a lower 
bound on the minimax rates established on Qf IL will also be a good lower bound for the 
larger parameter space © ML . Since the supremum over a larger space is always greater 
than the supremum over any of its subspaces, the lower bound on ©} /L is a lower bound 
for the larger space trivially, but being a least favorable subspace makes it match the rate. 
Throughout this section (proof of lower bound) we assume K > 3. The proof for the case 
K = 2 follows from Zhang and Zhou (2015) with the same modifications described below for 
the K > 3 case. 

We start with a couple of lemmas. The next lemma due to Zhang and Zhou (2015) 
shows that for any homogeneous parameter space which is closed under permutation (e.g., 
@f /L and all its submodels defined above), the minimum global Bayesian risk of z under 
the uniform prior is the same as the minimum of the local Bayesian risk for the first node. 
The local Bayesian risk for one node needs to be computed under an appropriate local loss 
function. Zhang and Zhou (2015) defined such a local loss function as the average over all 
possible permutations of z that minimizes the distance from the true class assignment. Let 
S z {z) = {z — S(z) : dn(z, z ) = inf ,5 dn{z, 5(5))}. Then the local loss function is defined as 


r(zi,Zi) 


1 

WM 


d H (zi,z$. 

z'GSz(z) 


(9.12) 
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Lemma 4. (Lemma 2.1 of Zhang and Zhou (2015)) Let A be any homogeneous parameter 
space which is closed under permutation and r be a uniform prior over the elements of A. 
Defining the global Bayesian risk as B T (z ) = jfy XLeA E[r( z ,z)] and local Bayesian risk for 
the first node (under the local loss function) as B r (zi) = XLeA E [ r i z u Z\)\, we have 

inf B t (z) = inf B r {z-f). 


Now we have the following lemma on the Bayesian local risk for the first node in the 
parameter space Qf IL under an uniform prior. 


Lemma 5. Let z be an estimated class assignment from some procedure in the block model 
defined by (9.11). Let r be a uniform prior over all elements in Q)f L . For the first node, the 
local Bayesian risk, B T (z\) = jqml^ Yl z ee^ L E i r ( z h 5i)] is lower bounded as 


B t (zi) > eP 


( 

y c w 


L#J 

E X F 


2—1 



UJ \ 

E h m) 


2=1 


/ 


(9.13) 


where e > 0 is a constant, c^ = log (- ———A-f ), and X ~ Bern(^~) and ~ 

& ^H(r J 1 K N ’ 1 

Bern{ q ^~) are independent random variables for alii = { 1 ,..., LfJ}. Moreover if N —> 

oo, then the right hand side of Equation (9.13) is greater than or equal to 


exp(-(l + o(l))iV^/( m 7/l), 

m 


while if Jv ^ y( — = 0(1), then the right hand side of Equation (9.13) is 0(1). 

Proof. We follow the proof of Lemma 5.1 in Section 6.2 of Zhang and Zhou (2015). Define 
as a subset of the parameter space of Qff L such that the class to which the first node 
belongs to is always of size LfJ +1, i.e., @^ L = {(z,P^) G Qff L : N Z] = LfJ +1}. Letting 
x 2 = (LfJ + l)^ it was shown in Section 6.2 of Zhang and Zhou (2015) that the ratio of 
the cardinality of the set to that of Qf IL is a constant, i.e., |©Jq L |/| ( 9;L fi | = % 2 /N > e 
for some e > 0. Consequently, 

1 L 1 z&e^L I Ai I z( z Q ml 

For each z' G ©Jff we define k'(z') = z[ as the class to which the first node belongs to. 
Let k[z !) be the set of indices of the communities of size [f J. Since the first community is 
of size LfJ + 1, k'(z') does not belong to k(z'). Now we define a new assignment z(z') based 
on z' as follows 


z(z') 1 = 


min{fc G k(z’) : k > k'(z')} 
min k(z') 


if ma xk(z') > k'(z') 
if ma xk(z') < k'(z'), 


(9.14) 


42 










and z(z')i = z\ for all i > 2. Clearly z(z') G ©)J L , differs from z! only in the first node 
and by definition has a distance 1 from it. Moreover for any two distinct class assignments 
z', z" G ©ij L , z' 7 ^ z", the new assignments based on them z(z') and z{z") are also different 
(Zhang and Zhou 2015). This implies that 0^ L = {z(z') : z' G @^ L }- Consequently, 


B r {Zi) > o I r\M L I J2 (^K^l)] + E l r (z(z')l,Zl)})- 
z \V Ll I z , e@AIL 

Next we will derive a lower bound for the Bayes risk, inf^ D r {z\). Conditional on z! or 
z(z '), the distribution of A in MLSBM involves a collection of M adjacency matrices. We 
define two sets J 0 and J\ as follows 


J 0 = {i 6 {l,...,iV}\{l}:z' = z;}, 

Ji = {i€ {1,... ,iV}\{l} : z' = z(z')i}. 


Hence, 

p(aw )= n {n 

m l ieJ 0 

and 


p { aw ))=n n 

m \ ieJj 


, . 4< m ) 

j(m) \ A u 


N 


, , 14 O) 

,(m) \ 1 A li 


1 - 


N 


n 

ieJi 


, . 4( m ) 


N 


1 - 




iV 


■/(7f c ), 

(9.15) 


, . 4( m ) 

. (m) \ "li 


AT 


, , i 4( m ) 

( (m) \ 1 A li 


1 - 


N 


n 

is Jo 


lv“ 




1 - 


"aT 


i-47° 


■/(kl c ), 


(9.16) 

where the function f(A c ) is a function involving connections from node 1 to nodes not in 
Jo U J\ and all connections not involving node 1. Let z B attains the inhmum of the local 
Bayes risk. Since dniz', z(z')) = 1, the loss with respect to the local loss function defined in 
Equation (9.12) is r {z[,z?) = d H (z[,z?) which is a 0-1 loss. Then zf is the Bayes estimator 
with respect to the local 0-1 loss function and consequently zf would be the mode of the 
posterior distribution, i.e., 

A, if Em E iE > Em E, E J, 

if Em E iE , 0 < Em EieJi C^A\ 


z, fl = 


(9.17) 


Hence we have, 


( 


inf B t (zi) > eP 


LfJ 


E- Y ‘ m) > E cl ”’E r . 


LfJ \ 


(m) 


V 


i= 1 


i=l 


(9.18) 




To derive the probability in the above lower bound, let Z t = Yh m Yhm ' 

F/ m) ). Hence the moment generating function (MGF) of Z % is, 


M Zi {t) = I \M z(m) (t) = Y[E(e tc(m)Xi )E(e- tcy "‘ ,Yi ) 


n 


, s hi m ) 

e te(ro) b — + 1 


N 


N 


_ tc (m)y 0 
(m) 


fc (m) a 


AT 


+ 1 


j.(m) 
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The MGF, Mz i (t) is minimized at t* = \ and the minimum value is 




nW &M / a M h( m ) 

- M/(l-)(1- 

N N V V N N 


(9.19) 


This implies - log (M z .{t*)) = Denoting S N , = Ef=i E m ^ for N ' = we 

obtain for any 5 > 0, 


N' M 


p(s N , > o) > x: n n ^ ^ 

N'S>S n />0 *=1 m= 1 


*\\N' 


Ms ^ 


N' M 


e nn 




exp(7VT*<5) 1111 M ( m )(t* 

’ N'5>S n />0 i=l m=l z\ 'V 


Now denoting q m (w) = for all m, we have 

„(m) \ / 


iV' M 


P(Sjv' > o) > exp(-Ad^ /<"*)) exp(-W<$) e nn* 

m N'8>S n />0 i= 1 m=l 


m\ z i 




We note that q m {w) is a probability mass function for all m G {1,... M}. Let f G 

{1,..., IV'}, be i.i.d random variables with probability mass function q m {w). Then we have, 

P(S W , > 0) > exp(-iV' /<"•)) vq>(-NVS)P(5 > ^EEW”) S »)• < 9 - 20 ) 

m z=l m 


Clearly — Y- rr ^) can take 3 values, ±c^ and 0. The first two values corre¬ 

spond to the cases when x\ m ' > = 1, Y- m} = 0 and Y^' n) = 1, X^ n) = 0 respectively. We com¬ 
pute the first probability as q m (wj = c^b) = exp (c^/2) ^1 — /M z (»(l/2). 

The second one follows similarly. Hence we have, 


W, M = 


-,{m) 


w.p 


a(m) b(m) I 

N ~N~ 


( 1_^)(1_^))/ M (1/2) 


_ C M w . p ^ *£>(1 - ^)(1 - ^)/ Af z ( m) ( 1 / 2 ) 

0 w.p 1 - P(H} (m) = C ( m )) - P(H} (m) = -c^). 


,(m) ■ 


fo( m ) - 


Hence 


E(W t {m) ) = 0 and Var(W$ m) ) = 2 (c^) 2 J ^^(1 - ^)(1 - ^)/M„ (m) ( 1/2). 

V 

Hence denoting as H 7 ,- we have, P(yr Sili kbj) = 0. Also by independence we 

the have variance of -X Wi as V = )C m V ar ( W - m) ) / N 1 = where V ! - m ' ) = 

Var(wl m) )/N’. _ 

We now prove that / a/V —> oo. First we consider the case when a DO x b^ n \ 

Then we have A”b x A ) ^ ) ^ (Zhang and Zhou 2015). On the other hand replac- 

A77 1 AT ITS l T r(m) ^ (d m b 2 a( m ) //at/ ts\ ^ (a( m )-b( m )) 2 K ■ ( m ) ^ 

mg N by N/K we have V (m > x i— a- /(N/K) x 1 tt(m)jv2 J smce c x a (4 

and M z[ m)(t*) = exp(-/ (m) ) = 0(1). Consequently, a/H x ~ '1 Clearly 
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Zml {m) /W x ^■y/Em ( ' a(ni> a i^) m ^ x \/ A ^a- /( ~ “>► 00. Next consider the other case 

b( m ) — 0 ( a ( m )). Then Em-^ m ^ x ^"jv^ and x log(c7 m )/&( m )). Consequently, V( m ' ) x 

^(log(fS)) 2 yiS/f. Hence v / C = o(VE m a (m) A7^)- This implies = 

W (VEm a(m V^)- Since Ym a{m) / K X N Ym^ m) / K °°> We ll£LVe E m 1 ^ / ^ 00 • 

Then we choose 5 = (E m -^”^\/Em Hl” 1 )) 1 / 2 so that 5 = o(E m -^ m ^) and a/H = 
\/(E m H© 1 )) = o(<5). Since the ratio of h to the square root of variance goes to infinity as N 

goes to infinity by the central limit theorem we have, P{5 > ^ E*Ii Em > 0) —* 1/2. 
Consequently from Equation (9.20), 


P(Sn> > 0) > exp(—(1 + o(l))N' ^2 


LfJ 


=»p 


Y" Z m ) Y" Y™ > 


/ I 


2=1 


E 1 

m 


,(™) 


m 

LfJ 


\ 


2=1 


>exp(-(l + o(l)) 


JV EJ W 


Ji 


), 


7 


provided N ^ 00 . The last inequality is obtained by replacing iV 7 by [NJ. 

If however, NY m l(m) / K = ^L 1 ), we can choose a 5 so that NS/K is also a constant. 
Then considering the cases a x 6©d and = o(Z m 7 separately, from the earlier 
argument we have E m ^ m V"'7V F x \/iV E m 1^/K = 0(1) in both cases . So we have, 
g (m) = 0(1). Hence all the terms in the right hand side of Equation 

-' v 2-^m 1 


5 

Vv 


I< 


nVv 


(9.20) are 0(1) and consequently, P(SV > 0) is 0(1). 


□ 


Now we combine the results of these two lemmas to prove a lower bound on ©q 


P]ML 


Lemma 6. Under the assumption that 


N 'Em /(m) 


K 


—> 00 , 


inf sup E[r(z,z)\ > exp ( —(1 + e,v) 
2 zee¥ L \ 


1 v£„-r (m) 


K 


(9.21) 


for some sequence €n = o(l). Moreover, if ^^7 — = 0(1), then infs sup 0 M/. E[r(z, z)\ > c 
for some constant c > 0. 

Proof. Since Qf IL C ©q /Z/ , the minimax risk of ©g /L is lower bounded by the minimax 
risk of ©JE- Due to the fact that Bayes risk lower bounds the global risk, we also have 
inf~sup 2g0 ML E[r(z,z)\ > inf~ sup-, g0 ML B r (z). Hence we have from Lemma 5, 

inf sup E[r(z,z)\>mt sup E[r(z,z)\> inf sup B r (z) = inf sup B T (z\). 

z * ze&jf L 2 zee^ IL z zeBf IL 


S 

Now we need to obtain the minimax lower bound for the larger parameter space © Mi in 
the next lemma which concludes the proof for lower bound. 
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Lemma 7. (Lower bound) Under the assumption that 


— y oo, 


N y j(m) 

K 


inf sup E[r(z, z)\ > 

Z qML 


exp(-(l + e*)^%^) 

exp(-(l + e ,)^%^) 


K = 2 
K > 3 


(9.22) 


for some sequence e jv = o(l) and some s > 0. Moreover, if — — = 0(1), then 

infj sup 0 Mi, E[r(z, 5)] > c for some constant c > 0. 

Proof. By the argument of Zhang and Zhou (2015), for K = 2, ©q /l is the least favorable 
case for Q AIL . Hence we can keep the same lower bound for Q AIL (obviously the lower 
bound holds since ©q /l is a subspace of <d AIL ). However for K > 3, this is not the case 
and we can improve the lower bound. The least favorable case consists of the case where 
at least a constant proportion of communities are of the size Define Qf IL to contain 
all z G Q AIL such that a constant proportion of communities have size , and another 
constant proportion of communities have size |~j^~| and all other communities are much larger 
in size. Then using identical arguments as Lemmas 4 and 5 we have, 


inf sup E[r(z, z)\ > inf sup B r (z\) 

* z£@ ml 


z&e™ L 


[jv j J 

> e p(^ c (m) f; xi m) > f; y/ m) ) 


2=1 


2=1 


> exp(—(1 + cat) 


NY /( m ), 

Z—jm 


sK 


Combining these two cases we have the result for the entire parameter space <d AIL . 


□ 


Upper bound 


To prove the upper bound, we develop a penalized likelihood type algorithm similar to Zhang 
and Zhou (2015) and show that its risk is upper bounded by the lower bound obtained in the 
previous step. We note that in the homogeneous MLSBM case (©q /l and 0f /L ), i.e., when 
all the intra-community connection probabilities are af' > /N and all the inter-community 
connection probabilities are /N for layer m, the log likelihood function is 


{ n (m) 

i<j 


Am) 


Zi = Zj} + log(l - ^—) J^(l - A^lizi = Zj} 


N 


i<j 


iJjn) _ h(m) 

+ lo s(^r) A< iT )l { Zi ± + 1 °s ( 1 - -rr) ~ ± Z A 


N ' ^ lJ 1 ‘ ' JJ N 

Kj i<j 


The maximum likelihood estimator z MLE is given by, 

z MLE = arg max T(z), 


(9.23) 
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where T(z) is given by 


T(z) = ]T | log 


a (m)( 1 _ 6 (m)/ iV ) 


J2 = - log 


i<j 


6M(1 - a ( m )/iV) 

= 2J } - fcf’-'in, = % }}, 


1 - ftM/jv 
1 - a( m )/iV 


1{^» 


(9.24) 


with C ( m ) > 0 is defined in Lemma 5 and /4 m ) = log ■ However in general the 

parameter space will not be homogeneous. Under the more general parameter space 0 Mi , we 
still define an identical form of the penalized likelihood estimator as z MLE . Let z be the true 
class assignment and z G 0q /l be an arbitrary class assignment satisfying r(z,z) = R/N, 
where 0 < R < N is a positive integer. Then note that 

T(z) - T(z) = = z j ] 

m m 

- Q2 k^l{zi = Zj} - ^ k( m h{zi = Zj}) 

m m 

= e tOM)} - 5><’">4 m) i{( s ,j) 6 «(!,*)}) 

m m 

- ^4 (m) (| (i,j) G 7(5,5)| - 1(4 j) G ar(M)l), (9.25) 


where a(4,z) = {(4 j) :i < j,z t = z j} Zi ^ %} and j(z,z) = {( i,j ) : i < j,z t ^ Zj,Zi = %}. 
Henceforth we will use shorthands a and 7 respectively to denote the sets. 

Let Pr = P{z G 0^ L : r(z,z) = R/N,T(z ) > T{z)). We want to bound P m which is 
the probability that an arbitrary class assignment 4 which does not agree with the truth z 
in exactly R places (after permutations) can maximize T(z), i.e., P(T(z) > T(z)). We start 
with the following lemma. 

Lemma 8. Let z be an arbitrary class assignment satisfying r(z, z) = R/N , where 0 < R < 
N is a positive integer. Then there exists a sequence e —> 0, independent of z, such that 


P(T(z) > T(z)) < 



I\ 

2 ( 1-6 )NRJ2 m I^) 
9 K 


+ R2 Em /M ) > 


ifR<£c, 

ifR>0 


Proof Let U^ = {uj m) ~ Hern(p[ m) )}, V (m) = {V) (m) ~ Hern(gJ m) )}, X (m) = {xj m) ~ 
Bern^q^)} and ld m ) = {Y^ r\j Bern(p ( m ))} are sets of independent Bernoulli random 

variables for arbitrary l. Further let min pO 4 p^ and max qj m} < q ^ m 4 Then we can 

( ^ ( m ) ( ^ ( m ) 

define two sets of random variables {A/" ~ Bern{^)} and { B ~ 5ern (^r)} inde¬ 
pendent of U and V. Now we define i.i.d copies {Ad m )'} of and {yM'} of 

and 


as Y, 


(m)' 


= and V/ m) = X/ mj B\ m >. Clearly, Y/ m) = U^’A\ m> < U t 


r(m) 


d m )' n(m) 


(m)' 


r(m) a (m) 


T (m) 
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y( m ) _ y(, m Yg( m ) < 

constants {c/™-*}, 


Hence we have for any real number s and sequence of positive 


M HI 

if s + ^ C M £ [/M < J] C M £ 

m 1=1 m 1=1 

HI HI 


then s + J2 e (m) ^ ^ - Y1 cM J2 X - 


(m) 


m 1=1 


m Z=1 


(9.26) 


Now we replace and with A^l{(i,j) e a(£, z)} and G 7 ( 5 , 2 )} 

respectively, pf n ' 1 and qj™'* with 7 ii™]/N and 1 t^JN respectively (recall was previously 
defined in the main article as the matrix of block connection probabilities in the MLSBM’s 
mth layer) , p^ and q^ with a^/N and b^/N respectively and s with E m — M). 

Then we get using the result in Equation (9.26) and Equation (9.25), 


HI HI 

P(T(z) > T(z)) < P | ]T c<”> E x[ m) - V e<">> V lf>> > 5] *: (m »(l7l - H) 

n 1=1 m 1=1 m 


( ( HI HI \ / 

= p exp i E c (m) E A '. (m) - * E c<m) E y f° > exp It E ^ (m) (l 7 l - W) 

\ \ m 1=1 m 1=1 ] \ m 

< exp ^--t /H r,,) ( 7 * — [oj])^ (^E[e t ^‘ rnC(m)x ^ ']j ^ E[e ~ t ^ imC< ' m)y * , 


where the last inequality follows from Markov inequality. Now we choose t = t* = 1/2. Then 
we have 


E\e 


i*E m cWjW, _ 


n 


1 _ /n \ 1/2 
(m) /N J 


r a (m)b(m)y/2 


1 - E m >/N J \ N 
exp(^fc( m )/2)exp(-^/( m V2) 


+ (1 


2 (m) 

Jv 


) 1/2 (1 



and £e-^-" c(m)y ‘ m) = exp(- E & (m) /2) exp(- E m /(m) / 2 )- Consequently, we have 

P(T(5) > T(z)) < e'^^ E - /(m) . (9.27) 


A lower bound on the size of the sets a and 7 was given in Lemma 5.3 of Zhang and Zhou 
(2015). We use the results directly here : for an arbitrary assignment z G @q /l satisfying 
r(z, z ) = R/N, where 0 < R < N is a positive integer, we have 


min(|a(5, z)\, \i{z,z)\) > 


(l-e)NR 

K 

2{l-e)NR 
9 K 



2 


if R < 
if R > 


(9.28) 


Using this lower bound for both |a| and |y| immediately yields the result. 


0 
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Let T(s) denotes an equivalent class for z consisting of all permutations of z. In order 
to use an union bound for Pr, we need to count the cardinality of the set of Ts which have 
distance R from z. Next we use Proposition 5.2 in Zhang and Zhou (2015) which states that 


|{r : 35 G T s.t r{z,z) = R/N}\ < mm{(——) R ,K N }, 

IX 

to conclude through a union bound that, 

P R := {3 z G s.t r(z,z) = R/N,T(z ) > T(z)} 

< |{r : 3z e r s.t r(z, z) = R/N}\ max P(T(z ) > T(z)) 

z,r(z,z)=R/N 


The next result uses the above results to establish the upper bound. 

N V'' /( m ) 

Lemma 9. (Upper bound) Under the assumption that K —$■ oo, for the penalized 

maximum likelihood estimator z defined in Equation (9.24), we have 


sup E[r(z,z)\ < 

z£0 ml 


exp (—(1 + e N ) 
exp (—(1 + e N ) 


2 h 


sK 


K = 2, 
K > 3, 


(9.29) 


for some sequence e n = o(l) and s G [1,5/v^3] ■ 

Proof. The proof technique is similar to Zhang and Zhou (2015); we only modify the proof 
in places to suit our objective while keeping the approach the same. We first prove the result 
for the subspace ©q /l and then extend it for @ A/L . We first consider the case K — y oo, 

break the assumption — > oo into 3 parts and verify that in each case E[r(z,z)\ 

is bounded by a term of the form exp(—(1 + o(l)) A —). Let 7 ] = o(l) be a universal 
sequence independent of N that converges to 0. We note that 


N 

NE[r(z,z)} < J2 RP R- 

R= 1 


jYy' /( m ) _ 2n)NT /( m ) 

(1) If lim inf n^-oo k^n > C th ere exists a small constant e > 0 such that -- ki^n - > 

1 + e. Let 77 decay slowly such that both ^ — and ttE go to infinity. Let B = 

lVexp(—(1 - 3r/)NJ2 m I {m) / K )- Clearly, P, = eNK exp(-(M^ - l)E m /(m) ) < B - 
This follows by replacing both log(eA') and 'Yh rn Rm ' > by a bigger term, r]N I^ 171 '/ K. 
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We will show that E[r(z, z MLE )] is bounded by 0(B/N). First let R G [2, Then, 


P R < 


eNK 


(1 -v)NE m i {m) 


— exp |^- K - 

eNK ( (1 -v)NJ2 m I (m) 
exp-,,- 


+ R^I (m) )) 


eNK 


(1 -v)NE m i {m) , (p , R 

- K - + {R+ R ^T 


E' 1 


< N exp I —(1 — rj)N ^ I^ m) /K + log(efl) 


(iV exp + 2 (g £ 1 {m> + log(eif) ) ) 

f/Vexp (-(! -2,,)JV^/<"•>///)') (Tex p (-(1 + t) log N + 


The penultimate step follows by replacing 1 — 2r) by 8/9 and the last step follows since 
e/4 — 3e 2 /4 > e /6 for large N and small r) and e respectively. Hence 

eN/3K oo yye/6 

NE[r(z, z)] = P, + Y, rp R < U + £ = P, + fl /6 _ = 0(B). 

P—O P—o V / 


(9.30) 

The inhnite sum in the last step can be obtained by differentiation the infinite series sum 
J//r =| At-d-R )/ 6 w ifh respect to At. 

Next we show that the same conclusion holds for R G [fyv N]. First, note that for any 
^ > R > §f, we have - R^2 m I {m) . Hence, 


Pr < 


eNK 2(1 -r,)NZ m lW \« 

eN/3K 9 K ) 


( , (1 — 27?) AT Y /( m ) NT J (m) ,3eK 2 

< e xp(--dW +log( — 


3e/l 2 . 2(1 - 77 ) At V I <m ) \ R 9 

— exp(- 


< -P (-(1 - ^ exp 

<Bexp(- 


< £exp(--(l + e) logAt )^- 9 

9 

< B i\T- 2 ( 1 +d(R-9)/9 < gjy-2(R-9)/9 
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By the same reasoning as above, J2^ =eN / 3K RPr < S/i=i RPr is o(B). Hence combining 
this result with Equation (9.30), NE[r(z,z)\ = 0(B) 

For the remaining two cases, (2) limsup '* < 1 and (3) = 1 + o(l), the 

proof follows from the corresponding cases in Zhang and Zhou (2015) ( Proof of Theorem 
3.2). Hence we omit the details and only write the results. 

N V'' /( m ) 

( 2 ) If lim supjv^po KiogN < then there exists a small constant e > 0 such that 

> 1 - e. Define R 0 = N ex p(-(l - K~ e / 2 )( 1 - V )N I {m) /K) and R! = 
N/K 1+e . We have, 


Pr< 


(^T ex P(— (1 ~ r,)JV ^ m/(m> + R’Em l{m) )) R < exp(- (1 -^ + ^ J(m) ) R 0 < R < R! 


(^exp(-^#^))« < exp(-^%^) 


R’ < R< N 


and hence from the proof in Zhang and Zhou (2015) E[r(z,z)\ = exp(— ^ —-). 

n y' /( m ) 

(3) If KiogN = 1 + °(1), then there exists a positive sequence w = o(l) such that 

< w. defining R 0 = N ex p(—(1 — w)N'^2 m I ( - m ' ) /K) and 


, NY i 


— 1| <C w and 


I K log N 

R! = w 2 N/K we have, 


yiog n 


Pr< 


(^F ex p (— (1 ri)N ^ I(m) + R'T, m lim) )) R < exp ( w(1 r,)NR p mIim) ) Ro<R<R' 
(g^f exp(- 2(1 -" )P ).J? mI(m> )) R < exp(— NR Em Iim) ) R' < R < N 


and hence from the proof in Zhang and Zhou (2015) E(r(z,z)\ = exp(— — —1). 

The proof for hnite K is similar and hence omitted. 

Now we prove the upper bound result for the entire parameter space @ Mi . The proof 
for the case K > 3 is similar to the proof for ©q /l with the result in (9.28) being replaced 
by Lemma A.l. of Zhang and Zhou (2015). However, for K = 2, we proceed as in Section 
A.2. of Zhang and Zhou (2015) and assume without loss of generality that R = |_yj • Let 
r(z,z ) = R/N and define the sets a and 7 as before. Note that R < N/2 since distance 
between the two class assignments d(z,z ) = imn(d H (z, z), N — dn(z,z)). We also have 
M + ItI = R(N — R) if r(z, z) = R/N (Zhang and Zhou 2015). Hence from Equation (9.27) 
we have, 


P(T(z) > T(z)) < exp 


(9.31) 


The proof is similar to the one for @g /L and we only specify the specific results here 
omitting the technicalities. Let 0 < e < 1/8 and recall that our assumption for K = 2 case 

N V /( m ) 

is that -y; - > 00 . We have the following 3 cases in parallel to the 3 cases earlier, 

(1) If > (1 + e), defining B = Nexp(—(N — 1) X] m -^ m V2) we have Pi < B. 

The for 1 < R < eN/2 we have, 

< (eNexp(-(l - e/2)(l + e) log N)) R < BN~ eR/ \ 
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and for eiV/2 < R < N/ 2 we have, 


Pr< 


,2eN. 
'~eN' 


R 


exp(- 


NR Em 


< B exp(- 


N(R - 4) Em / <ra) 


and hence 5)] = (1 + o(l))B/N. 

(2) If — < (1 — e), defining R 0 = iVexp(—(1 — e~ eN ^ ml{m) P)N 1^/2) and 

R' = Nexp(-NJ2 m / (m) /8) we have, 





R(N-R')j: m l(™^ < exp ( C- £jV J(m) /2 ^ J(m) ) 


R 0 < R< R' 
Rf <R< N/2 


and hence E[r(z,z)\ = (1 + o(l))R 0 /N. 


(3) If 


21ogiV 

— 1| <C w and 


I 2 log N 

R! = w 2 N we have, 


= 1 + o( 1), then there exists a positive sequence w = o(l) such that 
< w. Defining R 0 = N ex p(—(1 — w)N Yhm ^ m V2) and 


y/\ogN 


Pr< 



R(N-R') J2 m I (m) ^ < exp( - wNR X2,„ P m) \ 

< exp (— m«- 4) £,/-> ) 


R 0 < R< R' 
R’ <R< N/2 


and hence i?[r(;z, £)] = (1 + o(l))i?o/IV. 


□ 
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